logo_datactivist           logo_pleiade



Objectif : Analyse économétrique : exploration des données, modélisations et prévisions

Date début de l’analyse : 17 mai 2022

Date de la dernière modification : 13 juillet 2022

# Packages
packages = c("tidyverse", "glue", "kableExtra", "knitr", "plotly", "hrbrthemes", "gridExtra", "stats", "arrow", "lme4", "tidymodels", "broom.mixed", "multilevelmod", "dotwhisker", "rAmCharts", "DescTools", "readxl")

## Installation des packages si besoin et chargement des librairies
package.check <- lapply(packages,
  FUN = function(x) {
    if (!require(x, character.only = TRUE)) {
      install.packages(x, dependencies = TRUE)
      library(x, character.only = TRUE)}})

# Import BSO complet (2013-2020)
data <- read_csv("../data/4.process/managing-variables_2nd-part/whole_bso.csv", col_types = cols(author_position = "c"), na = c("", "NA"))

I - Modélisations for french AND non french CA articles


Analyse exploratoire


Péparation des données aux modélisations

# Création de nouvelles variables : couleur de journal finale et APC fiables
data <- data %>% mutate(oa_color_finale = case_when(oa_color_article_BSO == "diamond" ~ "diamond",
                                                    oa_color_openalex == "gold" ~ "gold",
                                                    oa_color_openalex == "hybrid" ~ "hybridOA",
                                                    TRUE ~ "other"), 
                        amount_apc_EUR_trusted = case_when(apc_source == "openAPC_estimation_publisher" | apc_source == "openAPC_estimation_publisher_year" ~ NA_real_,
                                                           TRUE ~ amount_apc_EUR))

Pour les modélisations, nous cherchons à expliquer et prédire le montant des APC (Y) avec 2 variables potentielles :

  • amount_apc_EUR : variable du BSO sans modification ;
  • amount_apc_EUR_trusted : montant des APC fiable, c’est-à-dire dont la source est doaj, OpenAPC, openAPC_estimation_issn_year ou openAPC_estimation_issn. Sur les 6 sources d’APC du BSO, “openAPC_estimation_publisher” et “openAPC_estimation_publisher_year” sont écartées.

L’analyse exploratoire nous dira quelle variable des APC est à retenir pour inclure dans les modèles. Dans tous les cas, les variables explicatives utilisées pour prédire Y seront “oa_color_finale”, “tier”, “bso_classification”, “is_french_CA_bso_wos_openalex_single_lang”, “year” et “is_covered_by_couperin”.

Pour préparer les données nous créons donc les variables “oa_color_finale” et “amount_apc_EUR_trusted”, puis nous sélectionnons les 6 variables qui entrent dans les modèles en supprimant les valeurs manquantes, et enfin nous filtrons sur les articles pour lesquels des APC ont été payés (apc_has_been_paid == 1, amount_apc_EUR_trusted > 0).

# Sélection de variables pour les modèles
bso_mvp <- data %>% 
    select(year, bso_classification, apc_has_been_paid, tier, oa_color_finale, is_french_CA_bso_wos_openalex_single_lang, amount_apc_EUR, amount_apc_EUR_trusted, is_covered_by_couperin) %>% 
    mutate(bso_classification = as.factor(bso_classification), 
           apc_has_been_paid = as.factor(apc_has_been_paid), 
           tier = as.factor(tier), 
           oa_color_finale = as.factor(oa_color_finale), 
           # On modifie les valeurs des catégories pour éviter les conflits de nom de classes dans les modélisations
           is_french_CA_bso_wos_openalex_single_lang = as.factor(case_when(is_french_CA_bso_wos_openalex_single_lang == 1 ~ "french CA",
                                                                           is_french_CA_bso_wos_openalex_single_lang == 0 ~ "non french CA")), 
           is_covered_by_couperin = as.factor(case_when(is_covered_by_couperin == 1 ~ "covered",
                                                        is_covered_by_couperin == 0 ~ "non covered")))

# Préparation du jeu de données prêt à modéliser
bso_pop <- bso_mvp %>% 
    filter(!is.na(bso_classification),
           !is.na(oa_color_finale),
           !is.na(is_french_CA_bso_wos_openalex_single_lang),
           !is.na(amount_apc_EUR),
           !is.na(is_covered_by_couperin)) %>% 
    filter(apc_has_been_paid == 1, amount_apc_EUR > 0)

# Fonction pour sauvegarder les viz au format SVG
saving_plot <- function(name, graph) {
  ggsave(file = glue("../figures/SVG_modelisations/{name}.svg"), plot=graph, width=10, height=6)
}


Y moyen par année et par variable catégorielle


Y1 = amount_apc_EUR

Discipline
table_y1_class <- bso_pop %>% group_by(year, bso_classification) %>% 
  summarise(montant_moyen = mean(amount_apc_EUR)) %>% ungroup()

  # Plot
graph <- ggplot(table_y1_class, aes(x = year, y = montant_moyen, group = bso_classification, colour = bso_classification)) +
  geom_line(size=1, alpha=0.9, linetype=1) +
  geom_point(colour="white", size = .6, pch = 21, stroke = 1.5) +
  scale_x_continuous(breaks=seq(2013, 2020, 1)) +
  labs(y = "Average APC", title = "Montants moyens des APC payés par discipline et par année", colour = "") +
  scale_color_manual(values = c("#025bbb", "#9d6939", "#3aaf29", "#f1e401", "#fcb2b8", "#fb6d35", "#284803", "#852895", "#919598", "#24c0e8"),
                     breaks = c("Biology (fond.)", "Medical research", "Chemistry", "Earth, Ecology, \nEnergy and applied biology", "Social sciences", "Computer and \n information sciences", "Physical sciences, Astronomy", "Engineering", "Mathematics", "Humanities")) +
  theme_classic() +
  theme(legend.position = "right", 
        axis.title.x = element_blank())
graph
saving_plot("1.explo_APC-paid_discipline", graph)


Tier
table_y1_tier <- bso_pop %>% group_by(year, tier) %>% 
  summarise(montant_moyen = mean(amount_apc_EUR)) %>% ungroup() %>% 
    mutate(tier = as.character(paste("tier", tier, sep = " ")))

  # Plot
graph <- ggplot(table_y1_tier, aes(x = year, y = montant_moyen, group = tier, colour = tier)) +
  geom_line(size=1, alpha=0.9, linetype=1) +
  geom_point(colour="white", size = .6, pch = 21, stroke = 1.5) +
  labs(y = "Average APC", title = "Montants moyens des APC payés par tier et par année", color = "") +
  theme_classic() +
  scale_x_continuous(breaks=seq(2013, 2020, 1)) +
  scale_color_manual(values = c("#009E73", "#660099", "#0072B2", "#D55E00")) +
  theme(legend.position = "right", 
        axis.title.x = element_blank()) +
  ggrepel::geom_text_repel(data = table_y1_tier %>% filter(year %% 2 == 0),
                           aes(label = round(montant_moyen, 0)), color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5) 
graph
saving_plot("2.explo_APC-paid_tier", graph)


French CA
table_y1_ca <- bso_pop %>% 
    group_by(year, is_french_CA_bso_wos_openalex_single_lang) %>% 
    summarise(montant_moyen = mean(amount_apc_EUR)) %>% ungroup() %>% 
    mutate(is_french_CA = as.character(is_french_CA_bso_wos_openalex_single_lang))

  # Plot
graph <- ggplot(table_y1_ca, aes(x = year, y = montant_moyen, group = is_french_CA, colour = is_french_CA)) +
  geom_line(size=1, alpha=0.9, linetype=1) +
  geom_point(colour="white", size = .6, pch = 21, stroke = 1.5) +
  labs(color = "", y = "Average APC", title = "Montants moyens des APC payés par année, selon si le CA est français") +
  scale_color_manual(values = c("#009E73", "#660099"),
                     breaks = c("non french CA", "french CA")) +
  theme_classic() +
  scale_x_continuous(breaks = seq(2013, 2020, 1)) +
  theme(legend.position = "right", axis.title.x = element_blank()) +
  ggrepel::geom_text_repel(data = table_y1_ca %>% filter(year %% 2 == 0),
                           aes(label = round(montant_moyen, 0)), color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5) 
graph
saving_plot("3.explo_APC-paid_frenchCA", graph)


Couleur OA
table_y1_color <- bso_pop %>% group_by(year, oa_color_finale) %>% 
  summarise(montant_moyen = mean(amount_apc_EUR)) %>% ungroup() %>% mutate(oa_color_finale = as.character(oa_color_finale))

  # Plot
graph <- ggplot(table_y1_color, aes(x = year, y = montant_moyen, group = oa_color_finale, colour = oa_color_finale)) +
  geom_line(size=1, alpha=0.9, linetype=1) +
  geom_point(colour="white", size = .6, pch = 21, stroke = 1.5) +
  labs(color = "", y = "Average APC", title = "Montants moyens des APC payés par couleur OA et par année") +
  theme_classic() +
  scale_x_continuous(breaks = seq(2013, 2020, 1)) +
  scale_color_manual(values = c("#0072B2", "#FFCC00"),
                     breaks = c("hybridOA", "gold")) +
  theme(legend.position = "right", axis.title.x = element_blank()) +
  ggrepel::geom_text_repel(data = table_y1_color %>% filter(year %% 2 == 0),
                           aes(label = round(montant_moyen, 0)), color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5) 
graph
saving_plot("4.explo_APC-paid_OAcolor", graph)


Covered by Couperin
table_y1_coup <- bso_pop %>% group_by(year, is_covered_by_couperin) %>% 
  summarise(montant_moyen = mean(amount_apc_EUR)) %>% ungroup() %>% 
    mutate(is_covered_by_couperin = case_when(is_covered_by_couperin == "covered" ~ "covered by couperin",
                                              is_covered_by_couperin == "non covered" ~ "not covered by couperin"))

  # Plot
graph <- ggplot(table_y1_coup, aes(x = year, y = montant_moyen, group = is_covered_by_couperin, colour = is_covered_by_couperin)) +
  geom_line(size=1, alpha=0.9, linetype=1) +
  geom_point(colour="white", size = .6, pch = 21, stroke = 1.5) +
  labs(color = "", y = "Average APC", title = "Montants moyens des APC payés par année, selon si l'article est couvert par les accords Couperin") +
  theme_classic() +
  scale_x_continuous(breaks = seq(2013, 2020, 1)) +
  scale_color_manual(values = c("#009E73", "#660099"),
                     breaks = c("covered by couperin", "not covered by couperin")) +
  theme(legend.position = "right", axis.title.x = element_blank()) +
  ggrepel::geom_text_repel(data = table_y1_coup %>% filter(year %% 2 == 0),
                           aes(label = round(montant_moyen, 0)), color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5) 
graph
saving_plot("5.explo_APC-paid_covered", graph)


Y2 = amount_apc_EUR_trusted

Discipline
table_y2_class <- bso_pop %>%  group_by(year, bso_classification) %>% 
  summarise(montant_moyen = mean(amount_apc_EUR_trusted, na.rm = T)) %>% ungroup()

  # Plot
graph <- ggplot(table_y2_class, aes(x = year, y = montant_moyen, group = bso_classification, colour = bso_classification)) +
  geom_line(size=1, alpha=0.9, linetype=1) +
  geom_point(colour="white", size = .6, pch = 21, stroke = 1.5) +
  scale_x_continuous(breaks=seq(2013, 2020, 1)) +
  labs(y = "Average APC", title = "Montants moyens des APC payés par discipline et par année", colour = "") +
  scale_color_manual(values = c("#025bbb", "#9d6939", "#3aaf29", "#f1e401", "#284803", "#fcb2b8", "#fb6d35", "#852895", "#24c0e8", "#919598"),
                     breaks = c("Biology (fond.)", "Medical research", "Chemistry", "Earth, Ecology, \nEnergy and applied biology", "Physical sciences, Astronomy", "Social sciences", "Computer and \n information sciences", "Engineering", "Humanities", "Mathematics")) +
  theme_classic() +
  theme(legend.position = "right", 
        axis.title.x = element_blank())
graph
saving_plot("6.explo_APC-paid_discipline_trusted", graph)


Tier
table_y2_tier <- bso_pop %>% group_by(year, tier) %>% 
  summarise(montant_moyen = mean(amount_apc_EUR_trusted, na.rm = T)) %>% ungroup() %>% 
    mutate(tier = as.character(paste("tier", tier, sep = " ")))

  # Plot
graph <- ggplot(table_y2_tier, aes(x = year, y = montant_moyen, group = tier, colour = tier)) +
  geom_line(size=1, alpha=0.9, linetype=1) +
  geom_point(colour="white", size = .6, pch = 21, stroke = 1.5) +
  labs(y = "Average APC", title = "Montants moyens des APC payés par tier et par année", color = "") +
  theme_classic() +
  scale_x_continuous(breaks=seq(2013, 2020, 1)) +
  scale_color_manual(values = c("#009E73", "#660099", "#0072B2", "#D55E00")) +
  theme(legend.position = "right", 
        axis.title.x = element_blank()) +
  ggrepel::geom_text_repel(data = table_y2_tier %>% filter(year %% 2 == 0),
                           aes(label = round(montant_moyen, 0)), color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5) 
graph
saving_plot("7.explo_APC-paid_tier_trusted", graph)


French CA
table_y2_ca <- bso_pop %>% 
    group_by(year, is_french_CA_bso_wos_openalex_single_lang) %>% 
    summarise(montant_moyen = mean(amount_apc_EUR_trusted, na.rm = T)) %>% ungroup() %>% 
    mutate(is_french_CA = as.character(is_french_CA_bso_wos_openalex_single_lang))

  # Plot
graph <- ggplot(table_y2_ca, aes(x = year, y = montant_moyen, group = is_french_CA, colour = is_french_CA)) +
  geom_line(size=1, alpha=0.9, linetype=1) +
  geom_point(colour="white", size = .6, pch = 21, stroke = 1.5) +
  labs(color = "", y = "Average APC", title = "Montants moyens des APC payés par année, selon si le CA est français") +
  scale_color_manual(values = c("#009E73", "#660099"),
                     breaks = c("non french CA", "french CA")) +
  theme_classic() +
  scale_x_continuous(breaks = seq(2013, 2020, 1)) +
  theme(legend.position = "right", axis.title.x = element_blank()) +
  ggrepel::geom_text_repel(data = table_y2_ca %>% filter(year %% 2 == 0),
                           aes(label = round(montant_moyen, 0)), color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5) 
graph
saving_plot("8.explo_APC-paid_frenchCA_trusted", graph)


Couleur OA
table_y2_color <- bso_pop %>% group_by(year, oa_color_finale) %>% 
  summarise(montant_moyen = mean(amount_apc_EUR_trusted, na.rm = T)) %>% ungroup() %>% mutate(oa_color_finale = as.character(oa_color_finale))

  # Plot
graph <- ggplot(table_y2_color, aes(x = year, y = montant_moyen, group = oa_color_finale, colour = oa_color_finale)) +
  geom_line(size=1, alpha=0.9, linetype=1) +
  geom_point(colour="white", size = .6, pch = 21, stroke = 1.5) +
  labs(color = "", y = "Average APC", title = "Montants moyens des APC payés par couleur OA et par année") +
  theme_classic() +
  scale_x_continuous(breaks = seq(2013, 2020, 1)) +
  scale_color_manual(values = c("#0072B2", "#FFCC00"),
                     breaks = c("hybridOA", "gold")) +
  theme(legend.position = "right", axis.title.x = element_blank()) +
  ggrepel::geom_text_repel(data = table_y2_color %>% filter(year %% 2 == 0),
                           aes(label = round(montant_moyen, 0)), color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5) 
graph
saving_plot("9.explo_APC-paid_OAcolor_trusted", graph)


Covered by Couperin
table_y2_coup <- bso_pop %>% group_by(year, is_covered_by_couperin) %>% 
  summarise(montant_moyen = mean(amount_apc_EUR_trusted, na.rm = T)) %>% ungroup() %>% 
    mutate(is_covered_by_couperin = case_when(is_covered_by_couperin == "covered" ~ "covered by couperin",
                                              is_covered_by_couperin == "non covered" ~ "not covered by couperin"))

  # Plot
graph <- ggplot(table_y2_coup, aes(x = year, y = montant_moyen, group = is_covered_by_couperin, colour = is_covered_by_couperin)) +
  geom_line(size=1, alpha=0.9, linetype=1) +
  geom_point(colour="white", size = .6, pch = 21, stroke = 1.5) +
  labs(color = "", y = "Average APC", title = "Montants moyens des APC payés par année, selon si l'article est couvert par les accords Couperin") +
  theme_classic() +
  scale_x_continuous(breaks = seq(2013, 2020, 1)) +
  scale_color_manual(values = c("#009E73", "#660099"),
                     breaks = c("covered by couperin", "not covered by couperin")) +
  theme(legend.position = "right", axis.title.x = element_blank()) +
  ggrepel::geom_text_repel(data = table_y2_coup %>% filter(year %% 2 == 0),
                           aes(label = round(montant_moyen, 0)), color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5) 
graph
saving_plot("10.explo_APC-paid_covered_trusted", graph)


Au vu des graphiques présentés ci-dessus, nous décidons de ne modéliser que les articles pour lesquels le montant des APC est fiable. Nous supprimons donc la variable des APC du BSO puis filtrons sur les articles ayant un APC fiable connu.

n_old <- nrow(bso_pop)
bso_pop <- bso_pop %>% select(-amount_apc_EUR) %>% filter(!is.na(amount_apc_EUR_trusted))

Le nombre d’articles dans notre base finale pour les modélisations est ainsi de 119616. En ne gardant que les sources d’APC fiables, 28554 articles ont été écartés, soit 19.27% des articles avec des APC payés et des valeurs connues pour les autres variables.


Valeurs extrèmes

amBoxplot(bso_pop$amount_apc_EUR_trusted, xlab=" ", ylab="Montant des APC fiables", main="Boxplot de Y")

L’étendue (écart entre les valeurs minimales et maximales) du montant des APC est importante, avec des valeurs allant de 2.27 à 9848.4 euros. Nous constatons 2 valeurs extrêmes qui se détachent largement des autres sur le boxplot, où les montants sont respectivement de 9848.4 et 9028.43 euros. Après vérification à partir de l’ISSN, les valeurs sont plausibles donc nous décidons de laisser ces valeurs aux données à modéliser.


Normalité de distribution

# Normalité de Y
mean <- mean(bso_pop$amount_apc_EUR_trusted, na.rm = TRUE)   
sd <- sd(bso_pop$amount_apc_EUR_trusted, na.rm = TRUE)  
ggplot(bso_pop, aes(bso_pop$amount_apc_EUR_trusted)) +
      geom_histogram(aes(y=..density..),  color="black", fill = "steelblue", alpha = 0.2) +
    geom_density( color="red", size = 1, adjust = 5) +
    stat_function(fun = dnorm, colour = "Black", size = 1, args = list(mean = mean, sd = sd))  +
    xlim(c(min(bso_pop$amount_apc_EUR_trusted)-1, max(bso_pop$amount_apc_EUR_trusted)+1))+ ggtitle("Distribution du montant des APC fiables") +
  xlab("Montant des APC fiables") + ylab("Densité") +
     theme_classic()

# Test statistique de normalité
out <- JarqueBeraTest(bso_pop$amount_apc_EUR_trusted, robust = FALSE, method = "chisq")  # Y non normal (on rejette H0)
library(formattable)
df <- data.frame(value = unlist(out))
tdf <- as.data.frame(t(df))
formattable(tdf)    
statistic.X-squared parameter.df p.value method data.name
value 49797.6693637252 2 0 Jarque Bera Test bso_pop$amount_apc_EUR_trusted

Le montant des APC fiables n’est pas normalement distribué.


Corrélations et dépendances entre les variables

# Fonction pour sauvegarder les résultats des tests
save_results <- function(output, num){
    df <- data.frame(value = unlist(output))
    tdf <- as.data.frame(t(df))
    assign(glue("tdf_{num}"), tdf[,1:5], envir = .GlobalEnv)
}
# Test de corrélation (Spearman car distribution non normale)
output <- cor.test(bso_pop$amount_apc_EUR_trusted, bso_pop$year, method = "spearman") # corrélation significative
df <- data.frame(value = unlist(output))
tdf <- as.data.frame(t(df))
tdf <- tdf %>% mutate(data.name = str_replace_all(data.name, "_", "\\_"),  #escape special characters
                          data.name = gsub("bso_pop\\$", "", data.name))
formattable(tdf) 
statistic.S p.value estimate.rho null.value.rho alternative method data.name
value 242633003532012 0 0.149384487346854 0 two.sided Spearman’s rank correlation rho amount_apc_EUR_trusted and year
# Tests de dépendance
    # Année
save_results(chisq.test(bso_pop$year, bso_pop$bso_classification), "1")
save_results(chisq.test(bso_pop$year, bso_pop$tier), "2")
save_results(chisq.test(bso_pop$year, bso_pop$oa_color_finale), "3")
save_results(chisq.test(bso_pop$year, bso_pop$is_french_CA_bso_wos_openalex_single_lang), "4")
save_results(chisq.test(bso_pop$year, bso_pop$is_covered_by_couperin), "5")
    # Discipline
save_results(chisq.test(bso_pop$bso_classification, bso_pop$tier), "6")
save_results(chisq.test(bso_pop$bso_classification, bso_pop$oa_color_finale), "7")
save_results(chisq.test(bso_pop$bso_classification, bso_pop$is_french_CA_bso_wos_openalex_single_lang), "8")
save_results(chisq.test(bso_pop$bso_classification, bso_pop$is_covered_by_couperin), "9")
    # Tier
save_results(chisq.test(bso_pop$tier, bso_pop$oa_color_finale), "10")
save_results(chisq.test(bso_pop$tier, bso_pop$is_french_CA_bso_wos_openalex_single_lang), "11")
save_results(chisq.test(bso_pop$tier, bso_pop$is_covered_by_couperin), "12")
    # Couleur OA
save_results(chisq.test(bso_pop$oa_color_finale, bso_pop$is_french_CA_bso_wos_openalex_single_lang), "13")
save_results(chisq.test(bso_pop$oa_color_finale, bso_pop$is_covered_by_couperin), "14")
    # French CA
save_results(chisq.test(bso_pop$is_french_CA_bso_wos_openalex_single_lang, bso_pop$is_covered_by_couperin), "15") #toutes dépendantes
# Présentation des résultats dans une table
output <- rbind(tdf_1,tdf_2,tdf_3,tdf_4,tdf_5,tdf_6,tdf_7,tdf_8,tdf_9,tdf_10,tdf_11,tdf_12,tdf_13,tdf_14,tdf_15) %>% 
    mutate(p.value = as.numeric(p.value),
           Dependance = case_when(p.value <= 0.05 ~ "Oui",
                                  TRUE ~ "Non"))
# Escape special characters
output <- output %>% mutate(data.name = str_replace_all(data.name, "_", "\\_"),  #escape special characters
                          data.name = gsub("bso_pop\\$", "", data.name))  #remove base name
rownames(output) <- NULL  #remove rownames
# Display table
kable(output, format = "html", caption = "Résultats des tests de dépendance entre les variables qualitatives") %>% 
  kable_styling(bootstrap_options = c("striped", "hover"), 
                full_width = F, 
                html_font = "sans-serif")
Résultats des tests de dépendance entre les variables qualitatives
statistic.X-squared parameter.df p.value method data.name Dependance
1844.04364295786 63 0 Pearson’s Chi-squared test year and bso_classification Oui
10564.3561750721 21 0 Pearson’s Chi-squared test year and tier Oui
89.5375126042915 7 0 Pearson’s Chi-squared test year and oa_color_finale Oui
207.441377682062 7 0 Pearson’s Chi-squared test year and is_french_CA_bso_wos_openalex_single_lang Oui
155.842642000071 7 0 Pearson’s Chi-squared test year and is_covered_by_couperin Oui
7903.46367919093 27 0 Pearson’s Chi-squared test bso_classification and tier Oui
4498.17074461241 9 0 Pearson’s Chi-squared test bso_classification and oa_color_finale Oui
223.795592556606 9 0 Pearson’s Chi-squared test bso_classification and is_french_CA_bso_wos_openalex_single_lang Oui
13109.2394339774 9 0 Pearson’s Chi-squared test bso_classification and is_covered_by_couperin Oui
11883.1772351368 3 0 Pearson’s Chi-squared test tier and oa_color_finale Oui
228.437752033453 3 0 Pearson’s Chi-squared test tier and is_french_CA_bso_wos_openalex_single_lang Oui
15863.9058727782 3 0 Pearson’s Chi-squared test tier and is_covered_by_couperin Oui
1870.71182474937 1 0 Pearson’s Chi-squared test with Yates’ continuity correction oa_color_finale and is_french_CA_bso_wos_openalex_single_lang Oui
49839.9633692406 1 0 Pearson’s Chi-squared test with Yates’ continuity correction oa_color_finale and is_covered_by_couperin Oui
1145.55413773546 1 0 Pearson’s Chi-squared test with Yates’ continuity correction is_french_CA_bso_wos_openalex_single_lang and is_covered_by_couperin Oui
# Vérifier : 
    #- homoscédasticité (homogénéité de la variance au sein des sous-pops)
    #- indépendance (sous-pops doivent avoir un effet aléatoire sur Y)
    #- autocorrélation (résidus et erreurs ne doivent pas être auto-correlés)


Modélisations


Régressions linéaires : effets fixes


Modèle 1

  • Y : amount_apc_EUR_trusted
  • effets fixes : year
Summary
# Transformation de la variable année
bso_pop <- bso_pop %>% 
    mutate(year = year - 2013)

# Modèle linéaire
lm1 <- lm(amount_apc_EUR_trusted ~ year, data = bso_pop)

# Summary du modèle
summary <- tidy(lm1)
kable(summary, format = "html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover"), 
                full_width = F, 
                html_font = "sans-serif")
term estimate std.error statistic p.value
(Intercept) 1633.06034 5.168348 315.97340 0
year 39.79084 1.063911 37.40053 0

# Y estimé 
prediction_lm1 <- round(predict(lm1),2)
    # Mesures d'erreurs
mae_lm1 <- mean(abs(prediction_lm1 - bso_pop$amount_apc_EUR_trusted))
mdae_lm1 <- median(abs(prediction_lm1 - bso_pop$amount_apc_EUR_trusted))
sae_lm1 <- sum(abs(prediction_lm1 - bso_pop$amount_apc_EUR_trusted))

Chaque année, le montant des APC (fiables) augmente de 39.79 euros en moyenne.


Visualisation


Modèle 2

  • Y : amount_apc_EUR_trusted
  • effets fixes : year + tier + oa_color_finale + is_covered_by_couperin + is_french_CA_bso_wos_openalex_single_lang + bso_classification
# Modèle linéaire
lm2 <- lm(amount_apc_EUR_trusted ~ year + tier + oa_color_finale + is_covered_by_couperin + is_french_CA_bso_wos_openalex_single_lang + bso_classification, data = bso_pop)

# Summary du modèle
summary <- tidy(lm2)
kable(summary, format = "html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover"), 
                full_width = F, 
                html_font = "sans-serif")
term estimate std.error statistic p.value
(Intercept) 1949.00037 11.2763319 172.839925 0.0e+00
year 36.92390 0.9814459 37.621943 0.0e+00
tier2 -271.02121 8.5219874 -31.802583 0.0e+00
tier3 -401.64620 8.4729780 -47.403191 0.0e+00
tier4 -349.58549 9.0370551 -38.683563 0.0e+00
oa_color_finalehybridOA 763.45374 7.1434349 106.874879 0.0e+00
is_covered_by_couperinnon covered -72.01286 8.4917775 -8.480304 0.0e+00
is_french_CA_bso_wos_openalex_single_langnon french CA 71.06395 4.2454420 16.738881 0.0e+00
bso_classificationChemistry -404.66028 12.7847102 -31.651893 0.0e+00
bso_classificationComputer and information sciences -478.30548 16.1292903 -29.654466 0.0e+00
bso_classificationEarth, Ecology, Energy and applied biology -287.59635 7.6822988 -37.436236 0.0e+00
bso_classificationEngineering -587.72553 21.7926464 -26.968984 0.0e+00
bso_classificationHumanities -881.87264 27.4958951 -32.072884 0.0e+00
bso_classificationMathematics -1010.07039 28.5262179 -35.408493 0.0e+00
bso_classificationMedical research -23.60761 4.8664055 -4.851138 1.2e-06
bso_classificationPhysical sciences, Astronomy -404.98640 8.6786828 -46.664500 0.0e+00
bso_classificationSocial sciences -470.37703 27.9288201 -16.841994 0.0e+00

# Analyse des résidus
lm2_residu <- round(residuals(lm2),2)

# Y estimé 
prediction_lm2 <- round(predict(lm2),2)
    # Mesures d'erreurs
mae_lm2 <- mean(abs(prediction_lm2 - bso_pop$amount_apc_EUR_trusted))
mdae_lm2 <- median(abs(prediction_lm2 - bso_pop$amount_apc_EUR_trusted))
sae_lm2 <- sum(abs(prediction_lm2 - bso_pop$amount_apc_EUR_trusted))


Modèles multi-niveaux : effets fixes + aléatoires


Modèle 1

  • Y : amount_apc_EUR_trusted
  • effets fixes : year
  • effets aléatoires : bso_classification sur intercept

Chaque discipline dispose de son ordonnée à l’origine, et l’année impacte l’angle de la pente de régression.

Summary
## Modèle multi-niveaux
mm1 <- lmer(amount_apc_EUR_trusted ~ year + (1 | bso_classification), data = bso_pop)

# Summary du modèle
print_summary <- function(model, name){
    
    summary <- tidy(model)
    print(kable(summary, format = "html") %>% 
      kable_styling(bootstrap_options = c("striped", "hover"), 
                    full_width = F, 
                    html_font = "sans-serif"))
    # Y estimé 
    prediction <- predict(model)
        # Mesures d'erreurs
    mae <- mean(abs(prediction - bso_pop$amount_apc_EUR_trusted))
    mdae <- median(abs(prediction - bso_pop$amount_apc_EUR_trusted))
    sae <- sum(abs(prediction - bso_pop$amount_apc_EUR_trusted))
        # Sauvegarde dans l'environnement
    assign(glue("summary_{name}"), summary, envir = .GlobalEnv)
    assign(glue("prediction_{name}"), prediction, envir = .GlobalEnv)
    assign(glue("mae_{name}"), mae, envir = .GlobalEnv)
    assign(glue("mdae_{name}"), mdae, envir = .GlobalEnv)
    assign(glue("sae_{name}"), sae, envir = .GlobalEnv)
    
}
print_summary(mm1, "mm1")
effect group term estimate std.error statistic
fixed NA (Intercept) 1305.66655 104.566292 12.48650
fixed NA year 42.21473 1.051654 40.14127
ran_pars bso_classification sd__(Intercept) 329.72755 NA NA
ran_pars Residual sd__Observation 805.72453 NA NA


Visualisation
# Sauvegarde les paramètres du modèle
intercept <- summary_mm1 %>% filter(term == "(Intercept)") %>% select(estimate)
    
# Visualisation des résultats
tidy(mm1, effects = "ran_vals") %>%  # effets aléatoires
    mutate(estimate = intercept$estimate + estimate,  # ajout de l'ordonnée à l'origine
            ymin = estimate - 2 * `std.error`, # intervalles à 95%
            ymax = estimate + 2 * `std.error`) %>% 
    arrange(estimate) %>% 
    mutate(level = as_factor(level)) %>% 
    ggplot(aes(x = level, y = estimate)) +
    ylab("Intercepts estimated") + 
    geom_pointrange(aes(ymin = ymin, ymax = ymax)) +
    geom_hline(yintercept = intercept$estimate, linetype = 2) +
    coord_flip()


Modèle 2

  • Y : amount_apc_EUR_trusted
  • effets fixes : year
  • effets aléatoires : bso_classification sur intercept (ordonnée à l’origine) et trend (pente de la droite de régression)

Chaque discipline dispose de son ordonnée à l’origine, et l’année impacte l’angle de la pente de régression par discipline.

Summary
## Modèle multi-niveaux
mm2 <- lmer(amount_apc_EUR_trusted ~ year + (1 + year | bso_classification), data = bso_pop)

# Summary du modèle
print_summary(mm2, "mm2")
effect group term estimate std.error statistic
fixed NA (Intercept) 1430.33224 101.49955 14.092006
fixed NA year 15.13775 10.39035 1.456905
ran_pars bso_classification sd__(Intercept) 318.00774 NA NA
ran_pars bso_classification cor__(Intercept).year -0.06309 NA NA
ran_pars bso_classification sd__year 31.66700 NA NA
ran_pars Residual sd__Observation 803.31022 NA NA


Visualisation
library(grid)
library(gtable)

viz_intercept_trend <- function(model, name, summary){
    
    # Sauvegarde les paramètres du modèle
    intercept <- summary %>% filter(term == "(Intercept)") %>% select(estimate)
    coef_trend <- summary %>% filter(term == "year") %>% select(estimate)
        # modèle entier
    tidy_model_intercept <- tidy(model, effects = "ran_vals") %>% arrange(estimate) %>% filter(term %in% "(Intercept)") %>% mutate(level = as_factor(level))
    
    # Visualisation des résultats
    
        ## Ordonnées à l'origine
    graph_intercept <- tidy_model_intercept %>% 
        mutate(estimate = intercept$estimate + estimate,  # ajout de l'ordonnée à l'origine
               ymin = estimate - 2 * `std.error`, # intervalles à 95%
               ymax = estimate + 2 * `std.error`) %>% 
        ggplot(aes(x = level, y = estimate)) +
        ylab("Intercepts estimated") + 
        geom_pointrange(aes(ymin = ymin, ymax = ymax)) +
        geom_hline(yintercept = intercept$estimate, linetype = 2) +
        coord_flip()
    
        ## On récupère l'ordre des niveaux pour l'injecter au deuxième graph 
    order_level <- tidy_model_intercept %>% select(level) %>% t()
    
        ## Coefficients des pentes
    graph_trend <- tidy(model, effects = "ran_vals") %>%  # effets aléatoires
        filter(term %in% "year") %>% # pentes
        mutate(level = factor(level, order = TRUE, levels = order_level)) %>% 
        mutate(estimate = coef_trend$estimate + estimate,  # ajout de l'ordonnée à l'origine
               ymin = estimate - 2 * `std.error`, # intervalles à 95%
               ymax = estimate + 2 * `std.error`) %>% 
        ggplot(aes(x = level, y = estimate)) +
        ylab("Slope coefficients estimated") + 
        geom_pointrange(aes(ymin = ymin, ymax = ymax)) +
        geom_hline(yintercept = coef_trend$estimate, linetype = 2) +
        coord_flip() +
        theme(axis.text.y = element_blank(),
              axis.title.y = element_blank())
    
    grid.newpage()
    grid.draw(cbind(ggplotGrob(graph_intercept), ggplotGrob(graph_trend), size="last"))

}
viz_intercept_trend(mm2, "mm2", summary_mm2)


Modèle 3

  • Y : amount_apc_EUR_trusted
  • effets fixes : year
  • effets aléatoires : tier sur intercept (ordonnée à l’origine) et trend (pente de la droite de régression)

Chaque tier dispose de son ordonnée à l’origine, et l’année impacte l’angle de la pente de régression par tier.

Summary
## Modèle multi-niveaux
mm3 <- lmer(amount_apc_EUR_trusted ~ year + (1 + year | tier), data = bso_pop)

# Summary du modèle
print_summary(mm3, "mm3")
effect group term estimate std.error statistic
fixed NA (Intercept) 1821.7423612 324.40976 5.6155597
fixed NA year 23.2517337 25.46493 0.9130884
ran_pars tier sd__(Intercept) 648.7117385 NA NA
ran_pars tier cor__(Intercept).year -0.9085942 NA NA
ran_pars tier sd__year 50.8713498 NA NA
ran_pars Residual sd__Observation 792.1446766 NA NA


Visualisation
viz_intercept_trend(mm3, "mm3", summary_mm3)


Modèle 4

  • Y : amount_apc_EUR_trusted
  • effets fixes : year
  • effets aléatoires : is_covered_by_couperin sur intercept (ordonnée à l’origine) et trend (pente de la droite de régression)

Chaque groupe d’articles selon qu’ils soient couvert par les accords Couperin dispose de son ordonnée à l’origine, et de son coefficient de pente de régression impacté par l’année.

Summary
## Modèle multi-niveaux
mm4 <- lmer(amount_apc_EUR_trusted ~ year + (1 + year | is_covered_by_couperin), data = bso_pop)

# Summary du modèle
print_summary(mm4, "mm4")
effect group term estimate std.error statistic
fixed NA (Intercept) 1887.64180 344.18858 5.484324
fixed NA year 30.58142 10.40395 2.939406
ran_pars is_covered_by_couperin sd__(Intercept) 486.68880 NA NA
ran_pars is_covered_by_couperin cor__(Intercept).year -1.00000 NA NA
ran_pars is_covered_by_couperin sd__year 14.64059 NA NA
ran_pars Residual sd__Observation 792.57366 NA NA


Visualisation
viz_intercept_trend(mm4, "mm4", summary_mm4)


Modèle 5

  • Y : amount_apc_EUR_trusted
  • effets fixes : year
  • effets aléatoires : oa_color_finale sur intercept (ordonnée à l’origine) et trend (pente de la droite de régression)

Chaque couleur OA dispose de son ordonnée à l’origine, et l’année impacte l’angle de la pente de régression par couleur OA.

Summary
## Modèle multi-niveaux
mm5 <- lmer(amount_apc_EUR_trusted ~ year + (1 + year | oa_color_finale), data = bso_pop)

# Summary du modèle
print_summary(mm5, "mm5")
effect group term estimate std.error statistic
fixed NA (Intercept) 1952.36971 438.95248 4.447793
fixed NA year 25.95908 17.97977 1.443794
ran_pars oa_color_finale sd__(Intercept) 620.72870 NA NA
ran_pars oa_color_finale cor__(Intercept).year -1.00000 NA NA
ran_pars oa_color_finale sd__year 25.38926 NA NA
ran_pars Residual sd__Observation 753.27830 NA NA


Visualisation
viz_intercept_trend(mm5, "mm5", summary_mm5)


Modèle 6

  • Y : amount_apc_EUR_trusted
  • effets fixes : year
  • effets aléatoires : is_french_CA_bso_wos_openalex_single_lang sur intercept (ordonnée à l’origine) et trend (pente de la droite de régression)

Chaque groupe d’articles selon que l’auteur correspondant soit français dispose de son ordonnée à l’origine, et de son coefficient de pente de régression impacté par l’année.

Summary
## Modèle multi-niveaux
mm6 <- lmer(amount_apc_EUR_trusted ~ year + (1 + year | is_french_CA_bso_wos_openalex_single_lang), data = bso_pop)

# Summary du modèle
print_summary(mm6, "mm6")
effect group term estimate std.error statistic
fixed NA (Intercept) 1648.06453 164.325970 10.029240
fixed NA year 38.07521 5.932018 6.418592
ran_pars is_french_CA_bso_wos_openalex_single_lang sd__(Intercept) 232.27695 NA NA
ran_pars is_french_CA_bso_wos_openalex_single_lang cor__(Intercept).year -1.00000 NA NA
ran_pars is_french_CA_bso_wos_openalex_single_lang sd__year 8.25383 NA NA
ran_pars Residual sd__Observation 816.34946 NA NA


Visualisation
viz_intercept_trend(mm6, "mm6", summary_mm6)


Comparaison des modèles multi-niveaux


Visualisation des APC estimés par modèle

APC cumulés
# On rassemble les prévisions
fitted <- bso_pop %>% select(year, amount_apc_EUR_trusted) %>% mutate(year = year + 2013)
fitted <- cbind(fitted, prediction_lm1, prediction_lm2, prediction_mm1, prediction_mm2, prediction_mm3, prediction_mm4, prediction_mm5, prediction_mm6)

# Visualisation
graph <- fitted %>% group_by(year) %>% summarise_all(sum) %>% 
    ggplot(aes(x = year)) +
    geom_line(aes(y = amount_apc_EUR_trusted, color = "observed values")) +
    geom_point(aes(y = amount_apc_EUR_trusted), col = "red", size = 1.2) +
    geom_line(aes(y = prediction_lm1, color = "lm1")) +
    geom_line(aes(y = prediction_lm2, color = "lm2")) +
    geom_line(aes(y = prediction_mm1, color = "mm1")) +
    geom_line(aes(y = prediction_mm2, color = "mm2")) +
    geom_line(aes(y = prediction_mm3, color = "mm3")) +
    geom_line(aes(y = prediction_mm4, color = "mm4")) +
    geom_line(aes(y = prediction_mm5, color = "mm5")) +
    geom_line(aes(y = prediction_mm6, color = "mm6")) +
    scale_color_manual(values = c('observed values' = "red",
                                  "lm1" = "#999999",
                                  "lm2" = "#E69F00",
                                  "mm1" = "#56B4E9",
                                  "mm2" = "#009E73",
                                  "mm3" = "#000000",
                                  "mm4" = "#0072B2",
                                  "mm5" = "#D55E00",
                                  "mm6" = "#CC79A7"
                                  )) +
    labs(y = "Total cost of APCs", title = "Somme des APC estimés par les modèles, par année", colour = " ") +
    scale_y_continuous(labels = scales::comma) +
    scale_x_continuous(breaks = seq(2013, 2020, 1)) +
    theme_classic() +
    theme(legend.position = "right", axis.title.x = element_blank()) +
    ggrepel::geom_text_repel(aes(y = amount_apc_EUR_trusted, label = paste("  ", format(round(amount_apc_EUR_trusted / 1e6, 1), trim = TRUE), "M€")),
                             color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5)
graph
saving_plot("11.models_APC-paid_sum", graph)


APC moyens
# Visualisation
graph <- fitted %>% group_by(year) %>% summarise_all(.funs = c(mean="mean")) %>% mutate(year = as.numeric(as.character(year))) %>% 
    ggplot(aes(x = year)) +
    geom_line(aes(y = amount_apc_EUR_trusted_mean, color = "observed values")) +
    geom_point(aes(y = amount_apc_EUR_trusted_mean), col = "red", size = 1.2) +
    geom_line(aes(y = prediction_lm1_mean, color = "lm1")) +
    geom_line(aes(y = prediction_lm2_mean, color = "lm2")) +
    geom_line(aes(y = prediction_mm1_mean, color = "mm1")) +
    geom_line(aes(y = prediction_mm2_mean, color = "mm2")) +
    geom_line(aes(y = prediction_mm3_mean, color = "mm3")) +
    geom_line(aes(y = prediction_mm4_mean, color = "mm4")) +
    geom_line(aes(y = prediction_mm5_mean, color = "mm5")) +
    geom_line(aes(y = prediction_mm6_mean, color = "mm6")) +
    scale_color_manual(values = c('observed values' = "red",
                                  "lm1" = "#999999",
                                  "lm2" = "#E69F00",
                                  "mm1" = "#56B4E9",
                                  "mm2" = "#009E73",
                                  "mm3" = "#000000",
                                  "mm4" = "#0072B2",
                                  "mm5" = "#D55E00",
                                  "mm6" = "#CC79A7"
                                  )) +
    labs(color = "", y = "Average APC", title = "Moyenne des APC estimés par les modèles, par année", colour = "") +
    scale_y_continuous(labels = scales::comma) +
    scale_x_continuous(breaks=seq(2013, 2020, 1)) +
    theme_classic() +
    theme(legend.position = "right", axis.title.x = element_blank()) +
    ggrepel::geom_text_repel(aes(y = amount_apc_EUR_trusted_mean, label = paste(" ", round(amount_apc_EUR_trusted_mean, 0))), color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5)
graph
saving_plot("12.models_APC-paid_mean", graph)

Les graphiques représentant les APC cumulés et moyens montrent tous deux que la meilleure estimation (la courbe s’approchant le plus des valeurs observées en rouge) est le troisième modèle multi-niveaux, “mm3”. Il s’agit du modèle où la source de variation est le tier.


Tableau récaptiulatif

# Tableau récapitulatif
recapitulatif <- data.frame(
    fix.ef = c("year",
               "year + tier + oa_color_finale + is_covered_by_couperin + is_french_CA_bso_wos_openalex_single_lang + bso_classification", 
               "year", 
               "year", 
               "year", 
               "year", 
               "year", 
               "year"),
    ran.ef = c("", 
               "", 
               "bso_classification", 
               "bso_classification", 
               "tier", 
               "is_covered_by_couperin", 
               "oa_color_finale", 
               "is_french_CA_bso_wos_openalex_single_lang"),
    AIC = c(AIC(lm1), AIC(lm2), AIC(mm1), AIC(mm2), AIC(mm3), AIC(mm4), AIC(mm5), AIC(mm6)),
    BIC = c(BIC(lm1), BIC(lm2), BIC(mm1), BIC(mm2), BIC(mm3), BIC(mm4), BIC(mm5), BIC(mm6)),
    MAE = c(mae_lm1, mae_lm2, mae_mm1, mae_mm2, mae_mm3, mae_mm4, mae_mm5, mae_mm6),
    MDAE = c(mdae_lm1, mdae_lm2, mdae_mm1, mdae_mm2, mdae_mm3, mdae_mm4, mdae_mm5, mdae_mm6),
    SAE = c(sae_lm1, sae_lm2, sae_mm1, sae_mm2, sae_mm3, sae_mm4, sae_mm5, sae_mm6))


# Table finale
recapitulatif <- recapitulatif %>% mutate_if(is.numeric, round, digits = 0)
recapitulatif %>% DT::datatable(options = list(searching = F), width = '60%')


Hétérogénéité de la variance

Tests d’hétérogénéité des variances

La variance peut différer d’un groupe à un autre et il convient de le vérifier statistiquement pour le prendre en compte si besoin. Le test de Bartlett ci-dessous teste l’homogénéité de la variance comme hypothèse nulle, nous l’appliquons à la variable de l’année ‘year’ et regardons son résultat :

# Test d'homogénéité de la variance (H0)
output <- bartlett.test(amount_apc_EUR_trusted ~ interaction(year), data = bso_pop)  #hétérogène

# Présentation des résultats
df <- data.frame(value = unlist(output))
tdf <- as.data.frame(t(df))
tdf <- tdf %>% mutate(data.name = str_replace_all(data.name, "_", "\\_"),  #escape special characters
                          data.name = gsub("bso_pop\\$", "", data.name))
formattable(tdf) 
statistic.Bartlett’s K-squared parameter.df p.value data.name method
value 1305.81315717348 7 9.19370755033575e-278 amount_apc_EUR_trusted by interaction(year) Bartlett test of homogeneity of variances

La p-value étant très proche de zéro nous rejetons l’hypothèse nulle : la variance est hétérogène. Dans les modèles suivants nous prenons en compte cette hétérogénéité de la variance et les comparons aux modélisations précédentes à l’aide d’une analyse de variance ANOVA.


Modélisations prenant en compte l’hétérogènéité

Nous construisons un modèle en ajoutant des poids (weights) pouvant varier d’une année à l’autre, pour un effet aléatoire par tier d’éditeurs (modèle le plus performant estimé précédemment).

## Modèle multi-niveaux
library(nlme)
hmm1 <- lme(amount_apc_EUR_trusted ~ year, 
            data = bso_pop, 
            random = ~ 1 + year | tier, 
            weights = varIdent(form = ~ 1 | year), 
            control =list(msMaxIter = 1000, msMaxEval = 1000))

# Summary du modèle
print_summary(hmm1, "hmm1")
effect group term estimate std.error df statistic p.value
fixed NA (Intercept) 1813.747058 234.56788 119611 7.732291 0.0000000
fixed NA year 24.371379 19.38816 119611 1.257024 0.2087474
ran_pars tier sd_(Intercept) 469.015194 NA NA NA NA
ran_pars tier cor_year.(Intercept) -0.856127 NA NA NA NA
ran_pars tier sd_year 38.710926 NA NA NA NA
ran_pars Residual sd_Observation 624.928686 NA NA NA NA
var_model varIdent(1 | year) 0 1.000000 NA NA NA NA
var_model varIdent(1 | year) 1 1.080814 NA NA NA NA
var_model varIdent(1 | year) 2 1.292266 NA NA NA NA
var_model varIdent(1 | year) 3 1.263160 NA NA NA NA
var_model varIdent(1 | year) 4 1.333781 NA NA NA NA
var_model varIdent(1 | year) 5 1.376769 NA NA NA NA
var_model varIdent(1 | year) 6 1.324006 NA NA NA NA
var_model varIdent(1 | year) 7 1.240774 NA NA NA NA


Comparaison des modèles : ANOVA

# ANOVA mm3 et hmm1
anova <- anova(mm3, hmm1)

# Présentation des résultats
df <- data.frame(value = unlist(anova))
tdf <- as.data.frame(t(df))
formattable(tdf) 
npar Sum Sq Mean Sq F value
value 1 523160.2 523160.2 0.8337304

#L'analyse de variance sous hypothèse nulle d'égalité de moyennes, montre qu'il n'y a pas de différence significative entre les deux modèles (p-value = ``r round(tdf$`Pr(>F)`, 2)``). Nous conservons donc le modèle initial présenté en partie précédente. 



II - Modélisations for french CA articles


# Préparation du jeu de données prêt à modéliser
bso_frenchCA <- bso_mvp %>% 
    filter(!is.na(bso_classification),
           !is.na(oa_color_finale),
           !is.na(is_french_CA_bso_wos_openalex_single_lang),
           !is.na(amount_apc_EUR),
           !is.na(is_covered_by_couperin)) %>% 
    filter(is_french_CA_bso_wos_openalex_single_lang == "french CA", apc_has_been_paid == 1, amount_apc_EUR > 0)

Par rapport à la partie précédente, ici ne sont gardés que les articles ayant un auteur correspondant français. Il y a ainsi 81054 articles avec un CA français, pour lesquels on sait qu’un APC a été payé, et dont la discipline, le tier, la couleur et la couverture par Couperin sont connus (seuls 4 articles avec un CA français et des APC payés ont été mis de côté parce que l’une des variables citées étaient inconnues).


Analyse exploratoire


amount_apc_EUR


Discipline

#detach(package:plyr)
table2_y1_class <- bso_frenchCA %>% group_by(year, bso_classification) %>% 
  summarise(montant_moyen = mean(amount_apc_EUR)) %>% ungroup()

  # Plot
graph <- ggplot(table2_y1_class, aes(x = year, y = montant_moyen, group = bso_classification, colour = bso_classification)) +
  geom_line(size=1, alpha=0.9, linetype=1) +
  geom_point(colour="white", size = .6, pch = 21, stroke = 1.5) +
  scale_x_continuous(breaks=seq(2013, 2020, 1)) +
  labs(y = "Average APC", title = "Montants moyens des APC payés par discipline et par année", colour = "") +
  scale_color_manual(values = c("#025bbb", "#9d6939", "#3aaf29", "#fb6d35", "#fcb2b8", "#f1e401", "#284803", "#852895", "#24c0e8", "#919598"),
                     breaks = c("Biology (fond.)", "Medical research", "Chemistry", "Computer and \n information sciences", "Social sciences", "Earth, Ecology, \nEnergy and applied biology", "Physical sciences, Astronomy", "Engineering", "Humanities", "Mathematics")) +
  theme_classic() +
  theme(legend.position = "right", 
        axis.title.x = element_blank())
graph
saving_plot("13.explo_french-CA_discipline", graph)


Tier

table2_y1_tier <- bso_frenchCA %>% group_by(year, tier) %>% 
  summarise(montant_moyen = mean(amount_apc_EUR)) %>% ungroup() %>% 
    mutate(tier = as.character(paste("tier", tier, sep = " ")))

  # Plot
graph <- ggplot(table2_y1_tier, aes(x = year, y = montant_moyen, group = tier, colour = tier)) +
  geom_line(size=1, alpha=0.9, linetype=1) +
  geom_point(colour="white", size = .6, pch = 21, stroke = 1.5) +
  labs(y = "Average APC", title = "Montants moyens des APC payés par tier et par année", color = "") +
  theme_classic() +
  scale_x_continuous(breaks=seq(2013, 2020, 1)) +
  scale_color_manual(values = c("#009E73", "#660099", "#0072B2", "#D55E00")) +
  theme(legend.position = "right", 
        axis.title.x = element_blank()) +
  ggrepel::geom_text_repel(data = table2_y1_tier %>% filter(year %% 2 == 0),
                           aes(label = round(montant_moyen, 0)), color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5) 
graph
saving_plot("14.explo_french-CA_tier", graph)


Couleur OA

table2_y1_color <- bso_frenchCA %>% group_by(year, oa_color_finale) %>% 
  summarise(montant_moyen = mean(amount_apc_EUR)) %>% ungroup() %>% mutate(oa_color_finale = as.character(oa_color_finale))

  # Plot
graph <- ggplot(table2_y1_color, aes(x = year, y = montant_moyen, group = oa_color_finale, colour = oa_color_finale)) +
  geom_line(size=1, alpha=0.9, linetype=1) +
  geom_point(colour="white", size = .6, pch = 21, stroke = 1.5) +
  labs(color = "", y = "Average APC", title = "Montants moyens des APC payés par couleur OA et par année") +
  theme_classic() +
  scale_x_continuous(breaks = seq(2013, 2020, 1)) +
  scale_color_manual(values = c("#0072B2", "#FFCC00"),
                     breaks = c("hybridOA", "gold")) +
  theme(legend.position = "right", axis.title.x = element_blank()) +
  ggrepel::geom_text_repel(data = table2_y1_color %>% filter(year %% 2 == 0),
                           aes(label = round(montant_moyen, 0)), color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5) 
graph
saving_plot("15.explo_french-CA_OAcolor", graph)


Covered by Couperin

table2_y1_coup <- bso_frenchCA %>% group_by(year, is_covered_by_couperin) %>% 
  summarise(montant_moyen = mean(amount_apc_EUR)) %>% ungroup() %>% 
    mutate(is_covered_by_couperin = case_when(is_covered_by_couperin == "covered" ~ "covered by couperin",
                                              is_covered_by_couperin == "non covered" ~ "not covered by couperin"))

  # Plot
graph <- ggplot(table2_y1_coup, aes(x = year, y = montant_moyen, group = is_covered_by_couperin, colour = is_covered_by_couperin)) +
  geom_line(size=1, alpha=0.9, linetype=1) +
  geom_point(colour="white", size = .6, pch = 21, stroke = 1.5) +
  labs(color = "", y = "Average APC", title = "Montants moyens des APC payés par année, selon si l'article est couvert par les accords Couperin") +
  theme_classic() +
  scale_x_continuous(breaks = seq(2013, 2020, 1)) +
  scale_color_manual(values = c("#009E73", "#660099"),
                     breaks = c("covered by couperin", "not covered by couperin")) +
  theme(legend.position = "right", axis.title.x = element_blank()) +
  ggrepel::geom_text_repel(data = table2_y1_coup %>% filter(year %% 2 == 0),
                           aes(label = round(montant_moyen, 0)), color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5) 
graph
saving_plot("16.explo_french-CA_covered", graph)


Y2 = amount_apc_EUR_trusted


Discipline

table2_y2_class <- bso_frenchCA %>%  group_by(year, bso_classification) %>% 
  summarise(montant_moyen = mean(amount_apc_EUR_trusted, na.rm = T)) %>% ungroup()

  # Plot
graph <- ggplot(table2_y2_class, aes(x = year, y = montant_moyen, group = bso_classification, colour = bso_classification)) +
  geom_line(size=1, alpha=0.9, linetype=1) +
  geom_point(colour="white", size = .6, pch = 21, stroke = 1.5) +
  scale_x_continuous(breaks=seq(2013, 2020, 1)) +
  labs(y = "Average APC", title = "Montants moyens des APC payés par discipline et par année", colour = "") +
  scale_color_manual(values = c("#025bbb", "#9d6939", "#3aaf29", "#f1e401", "#284803", "#fcb2b8", "#fb6d35", "#852895", "#24c0e8", "#919598"),
                     breaks = c("Biology (fond.)", "Medical research", "Chemistry", "Earth, Ecology, \nEnergy and applied biology", "Physical sciences, Astronomy", "Social sciences", "Computer and \n information sciences", "Engineering", "Humanities", "Mathematics")) +
  theme_classic() +
  theme(legend.position = "right", 
        axis.title.x = element_blank())
graph
saving_plot("17.explo_french-CA_discipline_trusted", graph)


Tier

table2_y2_tier <- bso_frenchCA %>% group_by(year, tier) %>% 
  summarise(montant_moyen = mean(amount_apc_EUR_trusted, na.rm = T)) %>% ungroup() %>% 
    mutate(tier = as.character(paste("tier", tier, sep = " ")))

  # Plot
graph <- ggplot(table2_y2_tier, aes(x = year, y = montant_moyen, group = tier, colour = tier)) +
  geom_line(size=1, alpha=0.9, linetype=1) +
  geom_point(colour="white", size = .6, pch = 21, stroke = 1.5) +
  labs(y = "Average APC", title = "Montants moyens des APC payés par tier et par année", color = "") +
  theme_classic() +
  scale_x_continuous(breaks=seq(2013, 2020, 1)) +
  scale_color_manual(values = c("#009E73", "#660099", "#0072B2", "#D55E00")) +
  theme(legend.position = "right", 
        axis.title.x = element_blank()) +
  ggrepel::geom_text_repel(data = table2_y2_tier %>% filter(year %% 2 == 0),
                           aes(label = round(montant_moyen, 0)), color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5) 
graph
saving_plot("18.explo_french-CA_tier_trusted", graph)


Couleur OA

table2_y2_color <- bso_frenchCA %>% group_by(year, oa_color_finale) %>% 
  summarise(montant_moyen = mean(amount_apc_EUR_trusted, na.rm = T)) %>% ungroup() %>% mutate(oa_color_finale = as.character(oa_color_finale))

  # Plot
graph <- ggplot(table2_y2_color, aes(x = year, y = montant_moyen, group = oa_color_finale, colour = oa_color_finale)) +
  geom_line(size=1, alpha=0.9, linetype=1) +
  geom_point(colour="white", size = .6, pch = 21, stroke = 1.5) +
  labs(color = "", y = "Average APC", title = "Montants moyens des APC payés par couleur OA et par année") +
  theme_classic() +
  scale_x_continuous(breaks = seq(2013, 2020, 1)) +
  scale_color_manual(values = c("#0072B2", "#FFCC00"),
                     breaks = c("hybridOA", "gold")) +
  theme(legend.position = "right", axis.title.x = element_blank()) +
  ggrepel::geom_text_repel(data = table2_y2_color %>% filter(year %% 2 == 0),
                           aes(label = round(montant_moyen, 0)), color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5) 
graph
saving_plot("19.explo_french-CA_OAcolor_trusted", graph)


Covered by Couperin

table2_y2_coup <- bso_frenchCA %>% group_by(year, is_covered_by_couperin) %>% 
  summarise(montant_moyen = mean(amount_apc_EUR_trusted, na.rm = T)) %>% ungroup() %>% 
    mutate(is_covered_by_couperin = case_when(is_covered_by_couperin == "covered" ~ "covered by couperin",
                                              is_covered_by_couperin == "non covered" ~ "not covered by couperin"))

  # Plot
graph <- ggplot(table2_y2_coup, aes(x = year, y = montant_moyen, group = is_covered_by_couperin, colour = is_covered_by_couperin)) +
  geom_line(size=1, alpha=0.9, linetype=1) +
  geom_point(colour="white", size = .6, pch = 21, stroke = 1.5) +
  labs(color = "", y = "Average APC", title = "Montants moyens des APC payés par année, selon si l'article est couvert par les accords Couperin") +
  theme_classic() +
  scale_x_continuous(breaks = seq(2013, 2020, 1)) +
  scale_color_manual(values = c("#009E73", "#660099"),
                     breaks = c("covered by couperin", "not covered by couperin")) +
  theme(legend.position = "right", axis.title.x = element_blank()) +
  ggrepel::geom_text_repel(data = table2_y2_coup %>% filter(year %% 2 == 0),
                           aes(label = round(montant_moyen, 0)), color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5) 
graph
saving_plot("20.explo_french-CA_covered_trusted", graph)


Ici encore en prenant comme population seulement les articles avec le CA français, nous décidons de modéliser le montant des APC fiable.

n_old <- nrow(bso_frenchCA)
bso_frenchCA <- bso_frenchCA %>% select(-amount_apc_EUR) %>% filter(!is.na(amount_apc_EUR_trusted))

Le nombre d’articles pour les modélisations est ainsi de 65785. En ne gardant que les sources d’APC fiables, 15269 articles ont été écartés, soit 18.84% des articles avec des APC payés, un CA français et des valeurs connues pour les autres variables.


Dépendances

# Test de corrélation (Spearman car distribution non normale)
output <- cor.test(bso_frenchCA$amount_apc_EUR_trusted, bso_frenchCA$year, method = "spearman") # corrélation significative
df <- data.frame(value = unlist(output))
tdf <- as.data.frame(t(df))
tdf <- tdf %>% mutate(data.name = str_replace_all(data.name, "_", "\\_"),  #escape special characters
                          data.name = gsub("bso_frenchCA\\$", "", data.name))
formattable(tdf) 
statistic.S p.value estimate.rho null.value.rho alternative method data.name
value 39956541196575.7 0 0.15791001897175 0 two.sided Spearman’s rank correlation rho amount_apc_EUR_trusted and year

# Tests de dépendance
    # Année
save_results(chisq.test(bso_frenchCA$year, bso_frenchCA$bso_classification), "1")
save_results(chisq.test(bso_frenchCA$year, bso_frenchCA$tier), "2")
save_results(chisq.test(bso_frenchCA$year, bso_frenchCA$oa_color_finale), "3")
save_results(chisq.test(bso_frenchCA$year, bso_frenchCA$is_covered_by_couperin), "4")
    # Discipline
save_results(chisq.test(bso_frenchCA$bso_classification, bso_frenchCA$tier), "5")
save_results(chisq.test(bso_frenchCA$bso_classification, bso_frenchCA$oa_color_finale), "6")
save_results(chisq.test(bso_frenchCA$bso_classification, bso_frenchCA$is_covered_by_couperin), "7")
    # Tier
save_results(chisq.test(bso_frenchCA$tier, bso_frenchCA$oa_color_finale), "8")
save_results(chisq.test(bso_frenchCA$tier, bso_frenchCA$is_covered_by_couperin), "9")
    # Couleur OA
save_results(chisq.test(bso_frenchCA$oa_color_finale, bso_frenchCA$is_covered_by_couperin), "10")
# Présentation des résultats dans une table
output <- rbind(tdf_1,tdf_2,tdf_3,tdf_4,tdf_5,tdf_6,tdf_7,tdf_8,tdf_9,tdf_10) %>% 
    mutate(p.value = as.numeric(p.value),
           Dependance = case_when(p.value <= 0.05 ~ "Oui",
                                  TRUE ~ "Non"))
# Escape special characters
output <- output %>% mutate(data.name = str_replace_all(data.name, "_", "\\_"),  #escape special characters
                          data.name = gsub("bso_frenchCA\\$", "", data.name))  #remove base name
rownames(output) <- NULL  #remove rownames
# Display table
kable(output, format = "html", caption = "Résultats des tests de dépendance entre les variables qualitatives") %>% 
  kable_styling(bootstrap_options = c("striped", "hover"), 
                full_width = F, 
                html_font = "sans-serif")
Résultats des tests de dépendance entre les variables qualitatives
statistic.X-squared parameter.df p.value method data.name Dependance
1179.36582494436 63 0 Pearson’s Chi-squared test year and bso_classification Oui
6674.06953803312 21 0 Pearson’s Chi-squared test year and tier Oui
49.3016149041887 7 0 Pearson’s Chi-squared test year and oa_color_finale Oui
74.7630393039464 7 0 Pearson’s Chi-squared test year and is_covered_by_couperin Oui
4064.54084270166 27 0 Pearson’s Chi-squared test bso_classification and tier Oui
2327.25470251979 9 0 Pearson’s Chi-squared test bso_classification and oa_color_finale Oui
7596.56394938853 9 0 Pearson’s Chi-squared test bso_classification and is_covered_by_couperin Oui
5757.9702867729 3 0 Pearson’s Chi-squared test tier and oa_color_finale Oui
7564.31853293548 3 0 Pearson’s Chi-squared test tier and is_covered_by_couperin Oui
21632.3827056521 1 0 Pearson’s Chi-squared test with Yates’ continuity correction oa_color_finale and is_covered_by_couperin Oui

Ici encore il existe une forte dépendance entre les variables explicatives du modèle, à ne pas inclure dans une même modélisation donc.


Distribution

# Normalité de Y
mean <- mean(bso_frenchCA$amount_apc_EUR_trusted, na.rm = TRUE)   
sd <- sd(bso_frenchCA$amount_apc_EUR_trusted, na.rm = TRUE)  
ggplot(bso_frenchCA, aes(bso_frenchCA$amount_apc_EUR_trusted)) +
      geom_histogram(aes(y=..density..),  color="black", fill = "steelblue", alpha = 0.2) +
    geom_density( color="red", size = 1, adjust = 5) +
    stat_function(fun = dnorm, colour = "Black", size = 1, args = list(mean = mean, sd = sd))  +
    xlim(c(min(bso_frenchCA$amount_apc_EUR_trusted)-1, max(bso_frenchCA$amount_apc_EUR_trusted)+1))+ ggtitle("Distribution du montant des APC fiables") +
  xlab("Montant des APC fiables") + ylab("Densité") +
     theme_classic()

# Test statistique de normalité
out <- JarqueBeraTest(bso_frenchCA$amount_apc_EUR_trusted, robust = FALSE, method = "chisq")  # Y non normal (on rejette H0)
library(formattable)
df <- data.frame(value = unlist(out))
tdf <- as.data.frame(t(df))
formattable(tdf)    
statistic.X-squared parameter.df p.value method data.name
value 36052.1176412677 2 0 Jarque Bera Test bso_frenchCA$amount_apc_EUR_trusted

Le montant des APC fiables n’est pas normalement distribué pour les articles ayant un CA français et où des APC ont été payés.

Modèles multi-niveaux

Modèle 1

  • Y : amount_apc_EUR_trusted
  • effets fixes : year
  • effets aléatoires : bso_classification sur intercept (ordonnée à l’origine) et trend (pente de la droite de régression)

Chaque discipline dispose de son ordonnée à l’origine, et l’année impacte l’angle de la pente de régression par discipline.

Summary

## Modèle multi-niveaux
bso_frenchCA <- bso_frenchCA %>% mutate(year = year - 2013)
mm2.1 <- lmer(amount_apc_EUR_trusted ~ year + (1 + year | bso_classification), data = bso_frenchCA)

# Summary du modèle
print_summary(mm2.1, "mm2.1")
effect group term estimate std.error statistic
fixed NA (Intercept) 1372.2979161 90.66522 15.135880
fixed NA year 14.0251387 11.13582 1.259461
ran_pars bso_classification sd__(Intercept) 281.5986411 NA NA
ran_pars bso_classification cor__(Intercept).year 0.0652288 NA NA
ran_pars bso_classification sd__year 33.4758250 NA NA
ran_pars Residual sd__Observation 764.0597813 NA NA


Visualisation

viz_intercept_trend(mm2.1, "mm2.1", summary_mm2.1)


Modèle 2

  • Y : amount_apc_EUR_trusted
  • effets fixes : year
  • effets aléatoires : tier sur intercept (ordonnée à l’origine) et trend (pente de la droite de régression)

Chaque tier dispose de son ordonnée à l’origine, et l’année impacte l’angle de la pente de régression par tier.

Summary

## Modèle multi-niveaux
mm2.2 <- lmer(amount_apc_EUR_trusted ~ year + (1 + year | tier), data = bso_frenchCA)

# Summary du modèle
print_summary(mm2.2, "mm2.2")
effect group term estimate std.error statistic
fixed NA (Intercept) 1732.9556270 212.67692 8.148301
fixed NA year 26.8037745 20.59414 1.301524
ran_pars tier sd__(Intercept) 425.0766712 NA NA
ran_pars tier cor__(Intercept).year -0.9076197 NA NA
ran_pars tier sd__year 41.0629758 NA NA
ran_pars Residual sd__Observation 763.5601474 NA NA


Visualisation

viz_intercept_trend(mm2.2, "mm2.2", summary_mm2.2)


Modèle 3

  • Y : amount_apc_EUR_trusted
  • effets fixes : year
  • effets aléatoires : is_covered_by_couperin sur intercept (ordonnée à l’origine) et trend (pente de la droite de régression)

Chaque groupe d’articles selon qu’ils soient couvert par les accords Couperin dispose de son ordonnée à l’origine, et de son coefficient de pente de régression impacté par l’année.

Summary

## Modèle multi-niveaux
mm2.3 <- lmer(amount_apc_EUR_trusted ~ year + (1 + year | is_covered_by_couperin), data = bso_frenchCA)

# Summary du modèle
print_summary(mm2.3, "mm2.3")
effect group term estimate std.error statistic
fixed NA (Intercept) 1812.33453 360.19889 5.031483
fixed NA year 30.25574 14.24999 2.123210
ran_pars is_covered_by_couperin sd__(Intercept) 509.27854 NA NA
ran_pars is_covered_by_couperin cor__(Intercept).year -1.00000 NA NA
ran_pars is_covered_by_couperin sd__year 20.06310 NA NA
ran_pars Residual sd__Observation 764.77417 NA NA


Visualisation

viz_intercept_trend(mm2.3, "mm2.3", summary_mm2.3)


Modèle 4

  • Y : amount_apc_EUR_trusted
  • effets fixes : year
  • effets aléatoires : oa_color_finale sur intercept (ordonnée à l’origine) et trend (pente de la droite de régression)

Chaque couleur OA dispose de son ordonnée à l’origine, et l’année impacte l’angle de la pente de régression par couleur OA.

Summary

## Modèle multi-niveaux
mm2.4 <- lmer(amount_apc_EUR_trusted ~ year + (1 + year | oa_color_finale), data = bso_frenchCA)

# Summary du modèle
print_summary(mm2.4, "mm2.4")
effect group term estimate std.error statistic
fixed NA (Intercept) 1929.00036 398.59206 4.8395353
fixed NA year 17.50577 24.02843 0.7285438
ran_pars oa_color_finale sd__(Intercept) 563.59894 NA NA
ran_pars oa_color_finale cor__(Intercept).year -1.00000 NA NA
ran_pars oa_color_finale sd__year 33.93122 NA NA
ran_pars Residual sd__Observation 737.07446 NA NA


Visualisation

viz_intercept_trend(mm2.4, "mm2.4", summary_mm2.4)


Modèle 5

  • Y : amount_apc_EUR_trusted
  • effets fixes : year

On fait un simple modèle linéaire.

Summary

## Modèle multi-niveaux
lm2.1 <- lm(amount_apc_EUR_trusted ~ year, data = bso_frenchCA)

# Summary du modèle
print_summary(lm2.1, "lm2.1")
term estimate std.error statistic p.value
(Intercept) 1561.70792 6.506397 240.02654 0
year 41.10933 1.357211 30.28958 0


Comparaison des modèles multi-niveaux

APC cumulés

# On rassemble les prévisions
fitted2 <- bso_frenchCA %>% select(year, amount_apc_EUR_trusted) %>% mutate(year = year + 2013)
fitted2 <- cbind(fitted2, prediction_mm2.1, prediction_mm2.2, prediction_mm2.3, prediction_mm2.4)

# Visualisation
graph <- fitted2 %>% group_by(year) %>% summarise_all(sum) %>% 
    ggplot(aes(x = year)) +
    geom_line(aes(y = amount_apc_EUR_trusted, color = "observed values")) +
    geom_point(aes(y = amount_apc_EUR_trusted), col = "red", size = 1.2) +
    geom_line(aes(y = prediction_mm2.1, color = "discipline")) +
    geom_line(aes(y = prediction_mm2.2, color = "tier")) +
    geom_line(aes(y = prediction_mm2.3, color = "covered by Couperin")) +
    geom_line(aes(y = prediction_mm2.4, color = "OA color")) +
    scale_color_manual(values = c('observed values' = "red",
                                  "discipline" = "#009E73",
                                  "tier" = "#000000",
                                  "covered by Couperin" = "#0072B2",
                                  "OA color" = "#FFCC00"
                                  )) +
    labs(color = "", y = "Total cost of APCs", title = "Somme des APC estimés par les modèles, par année
(Articles avec un auteur correspondant français)", colour = "Modélisations") +
    scale_y_continuous(labels = scales::comma) +
    scale_x_continuous(breaks=seq(2013, 2020, 1)) +
    theme_classic() +
    theme(legend.position = "right", axis.title.x = element_blank()) +
    ggrepel::geom_text_repel(aes(y = amount_apc_EUR_trusted, label = paste("  ", format(round(amount_apc_EUR_trusted / 1e6, 1), trim = TRUE), "M€")),
                             color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5)
graph
saving_plot("21.models_frenchCA_sum", graph)


APC moyens

# Visualisation
graph <- fitted2 %>% group_by(year) %>% summarise_all(.funs = c(mean="mean")) %>% mutate(year = as.numeric(as.character(year))) %>% 
    ggplot(aes(x = year)) +
    geom_line(aes(y = amount_apc_EUR_trusted_mean, color = "observed values")) +
    geom_point(aes(y = amount_apc_EUR_trusted_mean), col = "red", size = 1.2) +
    geom_line(aes(y = prediction_mm2.1_mean, color = "discipline")) +
    geom_line(aes(y = prediction_mm2.2_mean, color = "tier")) +
    geom_line(aes(y = prediction_mm2.3_mean, color = "covered by Couperin")) +
    geom_line(aes(y = prediction_mm2.4_mean, color = "OA color")) +
    scale_color_manual(values = c('observed values' = "red",
                                  "discipline" = "#009E73",
                                  "tier" = "#000000",
                                  "covered by Couperin" = "#0072B2",
                                  "OA color" = "#FFCC00"
                                  )) +
    labs(color = "", y = "Average APC", title = "Moyenne des APC estimés par les modèles, par année
(Articles avec un auteur correspondant français)", colour = "Modélisations") +
    scale_y_continuous(labels = scales::comma) +
    scale_x_continuous(breaks=seq(2013, 2020, 1)) +
    theme_classic() +
    theme(legend.position = "right", axis.title.x = element_blank()) +
    ggrepel::geom_text_repel(aes(y = amount_apc_EUR_trusted_mean, label = paste(" ", round(amount_apc_EUR_trusted_mean, 0))), color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5)
graph
saving_plot("22.models_frenchCA_mean", graph)

Visuellement parlant, la meilleure estimation semble être celle réalisée à partir du deuxième modèle où la source de variation est le tier.


Erreurs de prévisions par année

error_year <- fitted2 %>% group_by(year) %>% summarise_all(sum) %>% 
    mutate(error_discipline = round(abs(prediction_mm2.1 - amount_apc_EUR_trusted), 0),
           error_tier = round(abs(prediction_mm2.2 - amount_apc_EUR_trusted), 0),
           error_covered_couperin = round(abs(prediction_mm2.3 - amount_apc_EUR_trusted), 0),
           error_OAcolor = round(abs(prediction_mm2.4 - amount_apc_EUR_trusted), 0)) %>% 
    select(1, 7:10) %>% mutate(year = as.character(year)) %>% 
    bind_rows(summarise_all(., ~if(is.numeric(.)) sum(.) else "Somme des erreurs")) %>% 
    bind_rows(summarise_all(., ~if(is.numeric(.)) mean(.) else "Moyenne des erreurs")) %>% 
    mutate_if(is.numeric, round, digits = 0)
error_year %>% DT::datatable(options = list(searching = F), width = '60%')

Ici nous avons calculé le montant total d’APC payés par année selon le BSO (valeurs réelles) et selon les 4 modèles estimés. Nous avons ensuite soustrait les montants réels aux montants prédits par les modèles (en gardant une valeur absolue), de manière à obtenir une erreur de prévision par année. Enfin, nous avons calculé la somme et la moyenne de ces erreurs, ce qui nous permet de constater que le meilleur modèle est celui où le tier est la source de variation / l’effet aléatoire. C’est donc à partir de ce modèle que nous estimons le montant total d’APC pour les années 2021-2030, tendances inchangées.


Tableau récaptiulatif

# Tableau récapitulatif
recapitulatif2 <- data.frame(
    fix.ef = c("year",
               "year", 
               "year", 
               "year"),
    ran.ef = c("bso_classification", 
               "tier", 
               "is_covered_by_couperin", 
               "oa_color_finale"),
    AIC = c(AIC(mm2.1), AIC(mm2.2), AIC(mm2.3), AIC(mm2.4)),
    BIC = c(BIC(mm2.1), BIC(mm2.2), BIC(mm2.3), BIC(mm2.4)),
    MAE = c(mae_mm2.1, mae_mm2.2, mae_mm2.3, mae_mm2.4),
    MDAE = c(mdae_mm2.1, mdae_mm2.2, mdae_mm2.3, mdae_mm2.4),
    SAE = c(sae_mm2.1, sae_mm2.2, sae_mm2.3, sae_mm2.4))


# Table finale
    # on arrondit les mesures de qualité
recapitulatif2 <- recapitulatif2 %>% mutate_if(is.numeric, round, digits = 0)
    # on affiche le tableau
recapitulatif2 %>% DT::datatable(options = list(searching = F), width = '60%')

On trouve dans cette table récapitulative différentes mesures de la qualité des modèles. Les critères AIC, BIC ainsi que les mesures d’erreurs MAE (mean absolute error), MDAE (median absolute error) et SAE (sum absolute error) doivent être minimisés. Le classement des modèles au vu de ces statistiques n’est pas le même d’une mesure à une autre, ainsi :

  • AIC, BIC, MDAE : selon ces mesures le modèle qui minimise l’erreur est celui avec la couleur OA ;
  • MAE, SAE : selon ces mesures le meilleur modèle est celui avec la discipline.


Tables pivot (drag and drop)


Data des premiers modèles : articles où un APC a été payé

#devtools::install_github(c("ramnathv/htmlwidgets", "smartinsightsfromdata/rpivotTable"))
library(rpivotTable)
bso_pop <- bso_pop %>% mutate(year = year + 2013)
rpivotTable(bso_pop, rows = "year", cols = c("tier"), width = "100%", height = "400px")


Data des deuxièmes modèles : articles où un APC a été payé ET le CA français

bso_frenchCA <- bso_frenchCA %>% mutate(year = year + 2013)
rpivotTable(bso_frenchCA, rows = "year", cols = c("tier"), width = "100%", height = "400px")



III - Imputation des valeurs manquantes de Y : 2013-2020

Missing CA nationality

La variable “is_french_CA” qui informe si oui ou non l’auteur correspondant de l’article est français, contient 17.587 articles pour lesquels des APC ont été payés mais où la nationalité de l’auteur correspondant est inconnu. Nous imputons ces valeurs manquantes en suivant les étapes suivantes :

  • nous récupérons les proportions d’articles avec et sans french CA par tier et par année, pour les articles pour lesquels des APC ont été payés et où la nationalité du CA est connue ;
  • nous imputons aux valeurs inconnues les valeurs ‘0’ ou ‘1’ indiquant si le CA est français, en respectant les proportions des valeurs connues par tier et par année.

Avant cette manipulation, 81058 articles avaient un CA français, l’effectif s’élève désormais à 90820 articles entre 2013 et 2020.


Missing or non trusted APC amount

La variable “amount_apc_EUR_trusted” qui indique le montant des APC fiables payés en euros, contient 15.273 articles pour lesquels des APC ont été payés, l’auteur correspondant est français (avant imputation des valeurs manquantes) et le montant des APC est inconnu ou non fiable. Nous imputons ces valeurs manquantes en suivant les étapes suivantes :

  • nous récupérons les moyennes des APC payés par tier et par année, pour les articles pour lesquels des APC ont été payés, le CA est français et le montant des APC est connu et fiable ;
  • nous imputons aux valeurs inconnues les moyennes selon le tier et l’année de l’article.

Avant cette manipulation, le montant total des APC pour la période 2013-2020 était de 1.1418841^{8} euros (2.5173181^{7} en 2020). Il s’élève désormais à 1.4402254^{8} euros, 3.0132149^{7} en 2020.


Visualisation

  • Total cost of APCs in the wild per year
table <- bso_frenchCA %>% select(year, amount_apc_EUR_trusted) %>% group_by(year) %>% summarise_all(sum)

  # Plot
graph <- ggplot(table, aes(x = year, y = amount_apc_EUR_trusted)) +
  geom_line(size=1, alpha=0.9, linetype=1, col = "red") +
  geom_point(colour = "white", size = .6, pch = 21, stroke = 1.5) +
  labs(y = "Total cost of APCs", title = "Montant total des APC payés par année") +
  scale_y_continuous(labels = scales::comma, limits = c(2000000, 31000000)) +
  scale_x_continuous(breaks=seq(2013, 2020, 1)) +
  theme_classic() +
  theme(legend.position = "none", axis.title.x = element_blank()) +
  ggrepel::geom_text_repel(aes(label = paste("  ", format(round(amount_apc_EUR_trusted / 1e6, 1), trim = TRUE), "M€")), color = '#333333', size = 3, hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5)
graph
saving_plot("23.original_APC_cost", graph)

  • Données reconstruites
table <- completed_apc %>% 
    group_by(year, oa_color_finale) %>% 
    filter(is_french_CA_bso_wos_openalex_single_lang == 1 & apc_has_been_paid == 1) %>% 
    summarise(somme = sum(complete_amount_APC, na.rm = T)) %>% 
    ungroup() %>% 
    mutate(oa_color_finale = as.character(oa_color_finale)) %>% 
    group_by(year) %>% 
    bind_rows(summarise(.,
                      across(where(is.numeric), sum),
                      across(where(is.character), ~"overall")))

  # Plot
graph <- ggplot(table, aes(x = year, y = somme, group = oa_color_finale, colour = oa_color_finale)) +
  geom_line(size=1, alpha=0.9, linetype=1) +
  geom_point(colour = "white", size = .6, pch = 21, stroke = 1.5) +
  labs(color = "", y = "Total cost of APCs", title = "Montant total des APC payés par couleur OA et par année, 
après reconstruction de la variable") +
  scale_color_manual(breaks = c("overall", "gold", "hybridOA"), 
                     values = c("red", "#FFCC00", "#0072B2")) +
  scale_y_continuous(labels = scales::comma, limits = c(2000000, 31000000)) +
  scale_x_continuous(breaks=seq(2013, 2020, 1)) +
  theme_classic() +
  theme(legend.position = "right", axis.title.x = element_blank()) +
  ggrepel::geom_text_repel(data = table %>% filter(year %% 2 == 0),
                           aes(label = paste("  ", format(round(somme / 1e6, 1), trim = TRUE), "M€")), color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, 
                           min.segment.length = 0.2, size = 3)
graph
saving_plot("24.reconstructed_APC_cost", graph)



IV - Prévisions Y : 2021-2030


Max and min increase per year

Data from models (part II)

# Table de min et max par catégorie de variables quali (intégrées aux modèles)

    # Fonction
table_growth <- function(variable, name){
    
    table <- bso_frenchCA %>% group_by(year, {{variable}}) %>% summarise(nb_articles = n()) %>% ungroup() %>% 
        group_by({{variable}}) %>% mutate(diff = nb_articles - lag(nb_articles),
                                          growth = round((nb_articles - lag(nb_articles)) / lag(nb_articles) * 100, 2)) %>% na.omit() %>% 
        arrange(growth) %>% filter(row_number()==1 | row_number()==n()) %>% 
        mutate(mesure = case_when(row_number() == 1 ~ "min",
                                  row_number() == 2 ~ "max")) %>% 
        pivot_wider(names_from = mesure, values_from = c(nb_articles, year, diff, growth)) %>% 
        rename(variable = 1) %>% mutate(variable = as.character(variable))
    
    # Sauvegarde de la table
    assign(glue("growth_{name}"), table, envir = .GlobalEnv)

}

    # Couleur OA
table_growth(oa_color_finale, "color")
    # Tier
table_growth(tier, "tier")
    # Discipline 
table_growth(bso_classification, "discipline")
    # French CA (total car données filtrées uniquement sur is_french_CA == 1)
growth_year <- bso_frenchCA %>% group_by(year) %>% summarise(nb_articles = n()) %>% ungroup() %>% 
    mutate(diff = nb_articles - lag(nb_articles),
           growth = round((nb_articles - lag(nb_articles)) / lag(nb_articles) * 100, 2)) %>% na.omit() %>% 
    arrange(growth) %>% filter(row_number()==1 | row_number()==n()) %>% 
    mutate(mesure = case_when(row_number() == 1 ~ "min",
                              row_number() == 2 ~ "max"),
           variable = "whole data") %>% 
    pivot_wider(names_from = mesure, values_from = c(nb_articles, year, diff, growth)) %>% 
    mutate(variable = as.character(variable))

# Combinaison de ces valeurs dans une seule table
growth_all <- rbind(growth_color, growth_tier, growth_discipline, growth_year) %>% 
    mutate(Variable = case_when(variable == "1" ~ "tier 1",
                                variable == "2" ~ "tier 2",
                                variable == "3" ~ "tier 3",
                                variable == "4" ~ "tier 4",
                                TRUE ~ variable)) %>% 
    rename(`Évolution minimum nombre d'articles` = diff_min,
           `Évolution maximum nombre d'articles` = diff_max,
           `Growth rate minimum` = growth_min,
           `Growth rate maximum` = growth_max,
           `Année du G.R minimum` = year_min,
           `Année du G.R maximum` = year_max) %>% ungroup() %>% 
    select(Variable, `Évolution minimum nombre d'articles`, `Growth rate minimum`, `Année du G.R minimum`, `Évolution maximum nombre d'articles`, `Growth rate maximum`, `Année du G.R maximum`)

growth_all %>% DT::datatable(options = list(pageLength = 17, searching = F))


Whole dataset

# Table de min et max par catégorie de variables quali (intégrées aux modèles)

    # Function
table_growth2 <- function(variable, name){
    
    table <- data %>% group_by(year, {{variable}}) %>% summarise(nb_articles = n()) %>% ungroup() %>% 
        group_by({{variable}}) %>% mutate(diff = nb_articles - lag(nb_articles),
                                          growth = round((nb_articles - lag(nb_articles)) / lag(nb_articles) * 100, 2)) %>% na.omit() %>% 
        arrange(growth) %>% filter(row_number()==1 | row_number()==n()) %>% 
        mutate(mesure = case_when(row_number() == 1 ~ "min",
                                  row_number() == 2 ~ "max")) %>% 
        pivot_wider(names_from = mesure, values_from = c(nb_articles, year, diff, growth)) %>% 
        rename(variable = 1) %>% mutate(variable = as.character(variable))
    
    # Sauvegarde de la table
    assign(glue("growth2_{name}"), table, envir = .GlobalEnv)

}

    # Couleur OA
table_growth2(oa_color_finale, "color")
    # Tier
table_growth2(tier, "tier")
growth2_tier <- growth2_tier %>% arrange(variable)
    # French CA
table_growth2(is_french_CA_bso_wos_openalex_single_lang, "frenchCA")
growth2_frenchCA <- growth2_frenchCA %>% mutate(variable = case_when(variable == "1" ~ "french CA",
                                                                     variable == "0" ~ "non french CA"))
    # Discipline 
table_growth2(bso_classification, "discipline")
    # French CA (total car données filtrées uniquement sur is_french_CA == 1)
growth2_year <- data %>% group_by(year) %>% summarise(nb_articles = n()) %>% ungroup() %>% 
    mutate(diff = nb_articles - lag(nb_articles),
           growth = round((nb_articles - lag(nb_articles)) / lag(nb_articles) * 100, 2)) %>% na.omit() %>% 
    arrange(growth) %>% filter(row_number()==1 | row_number()==n()) %>% 
    mutate(mesure = case_when(row_number() == 1 ~ "min",
                              row_number() == 2 ~ "max"),
           variable = "whole data") %>% 
    pivot_wider(names_from = mesure, values_from = c(nb_articles, year, diff, growth)) %>% 
    mutate(variable = as.character(variable))

# Combinaison de ces valeurs dans une seule table
growth2_all <- rbind(growth2_color, growth2_tier, growth2_frenchCA, growth2_discipline, growth2_year) %>% 
    mutate(Variable = case_when(variable == "1" ~ "tier 1",
                                variable == "2" ~ "tier 2",
                                variable == "3" ~ "tier 3",
                                variable == "4" ~ "tier 4",
                                TRUE ~ variable)) %>% 
    rename(`Évolution minimum nombre d'articles` = diff_min,
           `Évolution maximum nombre d'articles` = diff_max,
           `Growth rate minimum` = growth_min,
           `Growth rate maximum` = growth_max,
           `Année du G.R minimum` = year_min,
           `Année du G.R maximum` = year_max) %>% ungroup() %>% 
    select(Variable, `Évolution minimum nombre d'articles`, `Growth rate minimum`, `Année du G.R minimum`, `Évolution maximum nombre d'articles`, `Growth rate maximum`, `Année du G.R maximum`)

growth2_all %>% DT::datatable(options = list(pageLength = 22, searching = F))


Scénario 1 : observed increase continue


Dans ce premier scénario aucune hypothèse n’est émise : les tendances restent inchangées. Les prix des APC et le nombre d’articles évoluent de manière naturelle.

Pour calculer les montants totaux des APC pour les années 2021 à 2030, les étapes sont les suivantes :

  • utilisation du meilleur modèle présenté ci-dessus (par tier) pour estimer l’évolution des APC moyens (en euros) d’une année à l’autre ;
  • estimation d’un nouveau modèle multi-niveaux basé sur les tiers pour estimer l’évolution du nombre d’articles d’une année à l’autre ;
  • calcul des APC moyens pour 2021-2030 (par tier) : on récupère les dernières valeurs connues des APC moyens (càd 2020) par tier, auxquelles on somme les coefficients d’évolution estimés par tier;
  • calcul du nombre d’articles pour 2021-2030 (par tier) : on récupère le nombre d’articles en 2020 par tier, auquel on somme les coefficients d’évolution estimés par tier ;
  • calcul du montant total pour 2021-2030 (par tier) : celui-ci est calculé en multipliant les APC moyens prédits par le nombre d’articles prédits, par tier ;
  • la valeur finale / unique par année de 2021 à 2030 est alors la somme des montants totaux calculés par tier.


Résultats globaux (modèle final)

Moyenne APC

plot_global_predictions <- function(predictions, predictions_min, predictions_max, nom_y_label, nom_title, ymin, ymax, name){
    
    graph <- ggplot(predictions_all, aes(x=year)) + 
        geom_line(aes(y = {{predictions}}), col = "red") +
        geom_point(aes(y = {{predictions}}), col = "red", size = 1.2) +
        geom_ribbon(aes(ymin = {{predictions_min}}, ymax = {{predictions_max}}), alpha = 0.1) +
        labs(color = "", y = nom_y_label, title = paste("Évolution", nom_title, "par année")) +
        scale_y_continuous(labels = scales::comma, limits = c(ymin, ymax)) +
        scale_x_continuous(breaks=seq(2014, 2030, 2)) +
        geom_vline(xintercept = 2020, linetype = "dashed", color = "black", size = .5) +
        theme_classic() +
        theme(legend.position = "right", axis.title.x = element_blank()) +
        ggrepel::geom_text_repel(data = predictions_all %>% filter(year %% 2 == 0),
                                 aes(y = {{predictions}}+0.01*{{predictions}}, label = ifelse({{predictions}} < 1000000, 
                                                                         round({{predictions}}, 0), 
                                                                         paste(" ", format(round({{predictions}} / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3)
    saving_plot(name, graph)
    graph
   
}
plot_global_predictions(mean_APC_total, mean_APC_total_min, mean_APC_total_max, "Average APC", "de la moyenne des APC", 615, 2800, "25.prev_S1_mean")


Nombre d’articles

plot_global_predictions(nb_articles_total, nb_articles_total_min, nb_articles_total_max, "Number of articles", "du nombre d'articles", 0, 30000, "26.prev_S1_articles")


Montant total

plot_global_predictions(montant_apc_total, montant_apc_total_min, montant_apc_total_max, "Total cost of APCs", "du montant total d'APC payés", 0, 72000000, "27.prev_S1_total")


Tableau des prévisions

prev <- predictions_all %>% select(1, 38:46)
prev %>% DT::datatable(options = list(pageLength = 9, searching = F, scrollY = "390px"), fillContainer = T)


Résultats par tier (modèle final)

Moyenne APC

plot_tier_predictions <- function(data,
                                  predictions_1, predictions_2, predictions_3, predictions_4, predictions_total, 
                                  predictions_min_1, predictions_min_2, predictions_min_3, predictions_min_4, predictions_min_total, 
                                  predictions_max_1, predictions_max_2, predictions_max_3, predictions_max_4, predictions_max_total, 
                                  nom_y_label, nom_title, 
                                  ymin, ymax,
                                  name){
    
    graph <- data %>% ggplot(aes(x=year)) + 
        geom_line(aes(y = {{predictions_1}}, color = "tier 1")) +
        geom_line(aes(y = {{predictions_2}}, color = "tier 2")) +
        geom_line(aes(y = {{predictions_3}}, color = "tier 3")) +
        geom_line(aes(y = {{predictions_4}}, color = "tier 4")) +
        geom_ribbon(aes(ymin = {{predictions_min_1}}, ymax = {{predictions_max_1}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_2}}, ymax = {{predictions_max_2}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_3}}, ymax = {{predictions_max_3}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_4}}, ymax = {{predictions_max_4}}), alpha = 0.1) +
        geom_line(aes(y = {{predictions_total}}, color = "overall"), size = .58) +
        geom_point(aes(y = {{predictions_total}}), col = "red", size = 1.2) +
        geom_ribbon(aes(ymin = {{predictions_min_total}}, ymax = {{predictions_max_total}}), alpha = 0.1) +
        scale_color_manual(values = c("overall" = "red",
                                      "tier 1" = "#009E73",
                                      "tier 2" = "#660099",
                                      "tier 3" = "#0072B2",
                                      "tier 4" = "#D55E00")) +
        theme_classic() +
        theme(legend.position = "right", axis.title.x = element_blank()) +
        labs(color = "", y = nom_y_label, title = paste("Évolution", nom_title, "par année"), color = " ") +
        scale_y_continuous(labels = scales::comma, limits = c(ymin, ymax)) +
        scale_x_continuous(breaks=seq(2014, 2030, 2)) +
        geom_vline(xintercept = 2020, linetype = "dashed", color = "black", size = .5) +
        ggrepel::geom_text_repel(data = data %>% filter(year %% 2 == 0),
                                 aes(y = {{predictions_total}}, label = ifelse({{predictions_total}} < 1000000, 
                                                                         round({{predictions_total}}, 0), 
                                                                         paste(" ", format(round({{predictions_total}} / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3) +
        ggrepel::geom_text_repel(data = data %>% filter(year == 2030),
                                 aes(y = {{predictions_1}}, label = ifelse({{predictions_1}} < 1000000, 
                                                                         round({{predictions_1}}, 0), 
                                                                         paste(" ", format(round({{predictions_1}} / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3) +
        ggrepel::geom_text_repel(data = data %>% filter(year == 2030),
                                 aes(y = {{predictions_2}}, label = ifelse({{predictions_2}} < 1000000, 
                                                                         round({{predictions_2}}, 0), 
                                                                         paste(" ", format(round({{predictions_2}} / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3) +
        ggrepel::geom_text_repel(data = data %>% filter(year == 2030),
                                 aes(y = {{predictions_3}}, label = ifelse({{predictions_3}} < 1000000, 
                                                                         round({{predictions_3}}, 0), 
                                                                         paste(" ", format(round({{predictions_3}} / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3) +
        ggrepel::geom_text_repel(data = data %>% filter(year == 2013),
                                 aes(y = {{predictions_1}}, label = " tier 1"), 
                                 color = '#009E73', hjust=-.85, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3) +
        ggrepel::geom_text_repel(data = data %>% filter(year == 2013),
                                 aes(y = {{predictions_2}}, label = " tier 2"), 
                                 color = '#660099', hjust=-.85, nudge_y = 2, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3) +
        ggrepel::geom_text_repel(data = data %>% filter(year == 2013),
                                 aes(y = {{predictions_3}}, label = " tier 3"), 
                                 color = '#0072B2', hjust=-.85, nudge_y = -3, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3)
    saving_plot(name, graph)
    graph

}
plot_tier_predictions(predictions_all,
                      mean_APC_1, mean_APC_2, mean_APC_3, mean_APC_4, mean_APC_total, 
                      mean_APC_min_1, mean_APC_min_2, mean_APC_min_3, mean_APC_min_4, mean_APC_total_min, 
                      mean_APC_max_1, mean_APC_max_2, mean_APC_max_3, mean_APC_max_4, mean_APC_total_max, 
                      "Average APC", "de la moyenne des APC",
                      615, 2800,
                      "28.prev_S1_tier_mean")


Nombre d’articles

plot_tier_predictions(predictions_all,
                      nb_articles_1, nb_articles_2, nb_articles_3, nb_articles_4, nb_articles_total, 
                      nb_articles_min_1, nb_articles_min_2, nb_articles_min_3, nb_articles_min_4, nb_articles_total_min, 
                      nb_articles_max_1, nb_articles_max_2, nb_articles_max_3, nb_articles_max_4, nb_articles_total_max, 
                      "Number of articles", "du nombre d'articles",
                      0, 30000,
                      "29.prev_S1_tier_articles")


Montant total

plot_tier_predictions(predictions_all,
                      montant_apc_1, montant_apc_2, montant_apc_3, montant_apc_4, montant_apc_total, 
                      montant_apc_min_1, montant_apc_min_2, montant_apc_min_3, montant_apc_min_4, montant_apc_total_min, 
                      montant_apc_max_1, montant_apc_max_2, montant_apc_max_3, montant_apc_max_4, montant_apc_total_max, 
                      "Total cost of APCs", "du montant total d'APC payés",
                      0, 72000000,
                      "30.prev_S1_tier_total")


Tableau des prévisions

prev <- predictions_all %>% select(1:37)
prev %>% DT::datatable(options = list(pageLength = 9, searching = F, scrollY = "390px"), fillContainer = T)


Résultats par couleur OA (décomposition)

Moyenne APC

plot_color_predictions <- function(predictions_gold, predictions_hybridOA, predictions_total, predictions_final, 
                                  predictions_min_gold, predictions_min_hybridOA, predictions_min_total, predictions_min_final, 
                                  predictions_max_gold, predictions_max_hybridOA, predictions_max_total, predictions_max_final, 
                                  nom_y_label, nom_title,
                                  ymin, ymax,
                                  name){
    
    graph <- predictions_all_color %>% ggplot(aes(x=year)) + 
        geom_line(aes(y = {{predictions_gold}}, color = "gold")) +
        geom_line(aes(y = {{predictions_hybridOA}}, color = "hybridOA")) +
        geom_line(aes(y = {{predictions_total}}, color = "overall"), size = .58) +
        geom_line(aes(y = {{predictions_final}}, color = "best model"), size = .58) +
        geom_point(aes(y = {{predictions_final}}), col = "red", size = 1.2) +
        geom_ribbon(aes(ymin = {{predictions_min_gold}}, ymax = {{predictions_max_gold}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_hybridOA}}, ymax = {{predictions_max_hybridOA}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_total}}, ymax = {{predictions_max_total}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_final}}, ymax = {{predictions_max_final}}), alpha = 0.1) +
        scale_color_manual(values = c("overall" = "black",
                                      "best model" = "red",
                                      "gold" = "#FFCC00",
                                      "hybridOA" = "#0072B2")) +
        theme_classic() +
        theme(legend.position = "right", axis.title.x = element_blank()) +
        labs(y = nom_y_label, title = paste("Évolution", nom_title, "par année et par couleur"), color = " ") +
        scale_y_continuous(labels = scales::comma, limits = c(ymin, ymax)) +
        scale_x_continuous(breaks=seq(2014, 2030, 2)) +
        geom_vline(xintercept = 2020, linetype = "dashed", color = "black", size = .5) +
        ggrepel::geom_text_repel(data = predictions_all_color %>% filter(year %% 2 == 0),
                                 aes(y = {{predictions_final}}, label = ifelse({{predictions_final}} < 1000000, 
                                                                         round({{predictions_final}}, 0), 
                                                                         paste("  ", format(round({{predictions_final}} / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3)
    saving_plot(name, graph)
    graph

}
plot_color_predictions(mean_APC_gold, mean_APC_hybridOA, mean_APC_total, mean_APC_total_final,
                      mean_APC_min_gold, mean_APC_min_hybridOA, mean_APC_total_min, mean_APC_total_min_final,
                      mean_APC_max_gold, mean_APC_max_hybridOA, mean_APC_total_max, mean_APC_total_max_final,
                      "Average APC", "de la moyenne des APC",
                      615, 2800, "31.prev_S1_color_mean")


Nombre d’articles

plot_color_predictions_text <- function(predictions_gold, predictions_hybridOA, predictions_total, predictions_final, 
                                  predictions_min_gold, predictions_min_hybridOA, predictions_min_total, predictions_min_final, 
                                  predictions_max_gold, predictions_max_hybridOA, predictions_max_total, predictions_max_final, 
                                  nom_y_label, nom_title,
                                  ymin, ymax, name){
    
    graph <- predictions_all_color %>% ggplot(aes(x=year)) + 
        geom_line(aes(y = {{predictions_gold}}, color = "gold")) +
        geom_line(aes(y = {{predictions_hybridOA}}, color = "hybridOA")) +
        geom_line(aes(y = {{predictions_total}}, color = "overall"), size = .58) +
        geom_line(aes(y = {{predictions_final}}, color = "best model"), size = .58) +
        geom_point(aes(y = {{predictions_final}}), col = "red", size = 1.2) +
        geom_ribbon(aes(ymin = {{predictions_min_gold}}, ymax = {{predictions_max_gold}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_hybridOA}}, ymax = {{predictions_max_hybridOA}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_total}}, ymax = {{predictions_max_total}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_final}}, ymax = {{predictions_max_final}}), alpha = 0.1) +
        scale_color_manual(values = c("overall" = "black",
                                      "best model" = "red",
                                      "gold" = "#FFCC00",
                                      "hybridOA" = "#0072B2")) +
        theme_classic() +
        theme(legend.position = "right", axis.title.x = element_blank()) +
        labs(y = nom_y_label, title = paste("Évolution", nom_title, "par année et par couleur"), color = " ") +
        scale_y_continuous(labels = scales::comma, limits = c(ymin, ymax)) +
        scale_x_continuous(breaks=seq(2014, 2030, 2)) +
        geom_vline(xintercept = 2020, linetype = "dashed", color = "black", size = .5) +
        ggrepel::geom_text_repel(data = predictions_all_color %>% filter(year %% 2 == 0),
                                 aes(y = {{predictions_final}}, label = ifelse({{predictions_final}} < 1000000, 
                                                                         round({{predictions_final}}, 0), 
                                                                         paste("  ", format(round({{predictions_final}} / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3) +
        ggrepel::geom_text_repel(data = predictions_all_color %>% filter(year == 2030),
                                 aes(y = {{predictions_hybridOA}}, label = ifelse({{predictions_hybridOA}} < 1000000, 
                                                                         round({{predictions_hybridOA}}, 0), 
                                                                         paste(" ", format(round({{predictions_hybridOA}} / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3)
    saving_plot(name, graph)
    graph

}
plot_color_predictions_text(nb_articles_gold, nb_articles_hybridOA, nb_articles_total, nb_articles_total_final,
                      nb_articles_min_gold, nb_articles_min_hybridOA, nb_articles_total_min, nb_articles_total_min_final,
                      nb_articles_max_gold, nb_articles_max_hybridOA, nb_articles_total_max, nb_articles_total_max_final,
                      "Number of articles", "du nombre d'articles",
                      0, 30000, "32.prev_S1_color_articles")


Montant total

plot_color_predictions_text(montant_apc_gold, montant_apc_hybridOA, montant_apc_total, montant_apc_total_final,
                      montant_apc_min_gold, montant_apc_min_hybridOA, montant_apc_total_min, montant_apc_total_min_final,
                      montant_apc_max_gold, montant_apc_max_hybridOA, montant_apc_total_max, montant_apc_total_max_final,
                      "Total cost of APCs", "du montant total d'APC payés",
                      0, 72000000, "33.prev_S1_color_total")


Tableau des prévisions

prev <- predictions_all_color %>% select(1:(ncol(predictions_all_color)-9))
prev %>% DT::datatable(options = list(pageLength = 9, searching = F, scrollY = "390px"), fillContainer = T)


Résultats selon la couverture par les accords Couperin (décomposition)

Moyenne APC

plot_couperin_predictions <- function(predictions_covered, predictions_non_covered, predictions_total, predictions_final, 
                                    predictions_min_covered, predictions_min_non_covered, predictions_min_total, predictions_min_final, 
                                    predictions_max_covered, predictions_max_non_covered, predictions_max_total, predictions_max_final, 
                                    nom_y_label, nom_title,
                                    ymin, ymax, name){
    
    graph <- predictions_all_couperin %>% ggplot(aes(x=year)) + 
        geom_line(aes(y = {{predictions_covered}}, color = "covered by couperin")) +
        geom_line(aes(y = {{predictions_non_covered}}, color = "non covered by couperin")) +
        geom_line(aes(y = {{predictions_total}}, color = "overall"), size = .58) +
        geom_line(aes(y = {{predictions_final}}, color = "best model"), size = .58) +
        geom_point(aes(y = {{predictions_final}}), col = "red", size = 1.2) +
        geom_ribbon(aes(ymin = {{predictions_min_covered}}, ymax = {{predictions_max_covered}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_non_covered}}, ymax = {{predictions_max_non_covered}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_total}}, ymax = {{predictions_max_total}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_final}}, ymax = {{predictions_max_final}}), alpha = 0.1) +
        scale_color_manual(values = c("overall" = "black",
                                      "best model" = "red",
                                      "non covered by couperin" = "#660099",
                                      "covered by couperin" = "#009E73")) +
        labs(color = "", y = nom_y_label, title = paste("Évolution", nom_title, "par année et 
selon la couverture par les accords Couperin"), color = " ") +
        scale_y_continuous(labels = scales::comma, limits = c(ymin, ymax)) +
        scale_x_continuous(breaks=seq(2014, 2030, 2)) +
        geom_vline(xintercept = 2020, linetype = "dashed", color = "black", size = .5) +
        theme_classic() +
        theme(legend.position = "right", axis.title.x = element_blank()) +
        ggrepel::geom_text_repel(data = predictions_all_couperin %>% filter(year %% 2 == 0),
                                 aes(y = {{predictions_final}}, label = ifelse({{predictions_final}} < 1000000, 
                                                                         round({{predictions_final}}, 0), 
                                                                         paste("  ", format(round({{predictions_final}} / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3) +
        ggrepel::geom_text_repel(data = predictions_all_couperin %>% filter(year == 2030),
                                 aes(y = {{predictions_covered}}, label = ifelse({{predictions_covered}} < 1000000, 
                                                                         round({{predictions_covered}}, 0), 
                                                                         paste(" ", format(round({{predictions_covered}} / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3)
    saving_plot(name, graph)
    graph

}
plot_couperin_predictions(mean_APC_covered, mean_APC_non_covered, mean_APC_total, mean_APC_total_final,
                      mean_APC_min_covered, mean_APC_min_non_covered, mean_APC_total_min, mean_APC_total_min_final,
                      mean_APC_max_covered, mean_APC_max_non_covered, mean_APC_total_max, mean_APC_total_max_final,
                      "Average APC", "de la moyenne des APC",
                      615, 2800, "34.prev_S1_couperin_mean")


Nombre d’articles

plot_couperin_predictions(nb_articles_covered, nb_articles_non_covered, nb_articles_total, nb_articles_total_final,
                      nb_articles_min_covered, nb_articles_min_non_covered, nb_articles_total_min, nb_articles_total_min_final,
                      nb_articles_max_covered, nb_articles_max_non_covered, nb_articles_total_max, nb_articles_total_max_final,
                      "Number of articles", "du nombre d'articles",
                      0, 30000, "35.prev_S1_couperin_articles")


Montant total

plot_couperin_predictions(montant_apc_covered, montant_apc_non_covered, montant_apc_total, montant_apc_total_final,
                      montant_apc_min_covered, montant_apc_min_non_covered, montant_apc_total_min, montant_apc_total_min_final,
                      montant_apc_max_covered, montant_apc_max_non_covered, montant_apc_total_max, montant_apc_total_max_final,
                      "Total cost of APCs", "du montant total d'APC payés",
                      0, 72000000, "36.prev_S1_couperin_total")


Tableau des prévisions

prev <- predictions_all_couperin %>% select(1:(ncol(predictions_all_couperin)-9))
prev %>% DT::datatable(options = list(pageLength = 9, searching = F, scrollY = "390px"), fillContainer = T)


Résultats par discipline (décomposition)

Moyenne APC

plot_discipline_predictions <- function(predictions_biology, predictions_chemistry, predictions_computer, predictions_earth, predictions_engineering, predictions_humanities, predictions_mathematics, predictions_medical, predictions_physical, predictions_social, predictions_total, predictions_final, 
                                        predictions_min_biology, predictions_min_chemistry, predictions_min_computer, predictions_min_earth, predictions_min_engineering, predictions_min_humanities, predictions_min_mathematics, predictions_min_medical, predictions_min_physical, predictions_min_social, predictions_min_total, predictions_min_final, 
                                        predictions_max_biology, predictions_max_chemistry, predictions_max_computer, predictions_max_earth, predictions_max_engineering, predictions_max_humanities, predictions_max_mathematics, predictions_max_medical, predictions_max_physical, predictions_max_social, predictions_max_total, predictions_max_final, 
                                        nom_y_label, nom_title,
                                        ymin, ymax, name){
    
    graph <- predictions_all_discipline %>% ggplot(aes(x=year)) + 
        geom_line(aes(y = {{predictions_biology}}, color = "Biology (fond.)")) +
        geom_line(aes(y = {{predictions_chemistry}}, color = "Chemistry")) +
        geom_line(aes(y = {{predictions_computer}}, color = "Computer and \n information sciences")) +
        geom_line(aes(y = {{predictions_earth}}, color = "Earth, Ecology, \nEnergy and applied biolog")) +
        geom_line(aes(y = {{predictions_engineering}}, color = "Engineering")) +
        geom_line(aes(y = {{predictions_humanities}}, color = "Humanities")) +
        geom_line(aes(y = {{predictions_mathematics}}, color = "Mathematics")) +
        geom_line(aes(y = {{predictions_medical}}, color = "Medical research")) +
        geom_line(aes(y = {{predictions_physical}}, color = "Physical sciences, Astronomy")) +
        geom_line(aes(y = {{predictions_social}}, color = "Social sciences")) +
        geom_line(aes(y = {{predictions_total}}, color = "overall"), size = .58) +
        geom_line(aes(y = {{predictions_final}}, color = "best model"), size = .58) +
        geom_point(aes(y = {{predictions_final}}), col = "red", size = 1.2) +
       # geom_ribbon(aes(ymin = {{predictions_min_biology}}, ymax = {{predictions_max_biology}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_chemistry}}, ymax = {{predictions_max_chemistry}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_computer}}, ymax = {{predictions_max_computer}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_earth}}, ymax = {{predictions_max_earth}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_engineering}}, ymax = {{predictions_max_engineering}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_humanities}}, ymax = {{predictions_max_humanities}}), alpha = 0.1) +
       # geom_ribbon(aes(ymin = {{predictions_min_mathematics}}, ymax = {{predictions_max_mathematics}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_medical}}, ymax = {{predictions_max_medical}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_physical}}, ymax = {{predictions_max_physical}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_social}}, ymax = {{predictions_max_social}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_total}}, ymax = {{predictions_max_total}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_final}}, ymax = {{predictions_max_final}}), alpha = 0.1) +
        scale_color_manual(values = c("overall" = "black",
                                      "best model" = "red",
                                      "Medical research" = "#9d6939",
                                      "Biology (fond.)" = "#025bbb",
                                      "Chemistry" = "#3aaf29",
                                      "Computer and \n information sciences" = "#fb6d35",
                                      "Earth, Ecology, \nEnergy and applied biolog" = "#f1e401",
                                      "Engineering" = "#852895",
                                      "Humanities" = "#24c0e8",
                                      "Mathematics" = "#919598",
                                      "Physical sciences, Astronomy" = "#284803",
                                      "Social sciences" = "#fcb2b8")) +
        theme_classic() +
        theme(legend.position = "right", axis.title.x = element_blank()) +
        labs(color = "", y = nom_y_label, title = paste("Évolution", nom_title, "par année et par discipline"), color = " ") +
        scale_y_continuous(labels = scales::comma, limits = c(ymin, ymax)) +
        scale_x_continuous(breaks=seq(2014, 2030, 2)) +
        geom_vline(xintercept = 2020, linetype = "dashed", color = "black", size = .5) +
        ggrepel::geom_text_repel(data = predictions_all_discipline %>% filter(year %% 2 == 0),
                                 aes(y = {{predictions_final}}, label = ifelse({{predictions_final}} < 1000000, 
                                                                         round({{predictions_final}}, 0), 
                                                                         paste("  ", format(round({{predictions_final}} / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3)
    saving_plot(name, graph)
    graph

}
plot_discipline_predictions(mean_APC_biology, mean_APC_chemistry, mean_APC_computer, mean_APC_earth, mean_APC_engineering, mean_APC_humanities, mean_APC_mathematics, mean_APC_medical, mean_APC_physical, mean_APC_social, mean_APC_total, mean_APC_total_final, 
                          mean_APC_min_biology, mean_APC_min_chemistry, mean_APC_min_computer, mean_APC_min_earth, mean_APC_min_engineering, mean_APC_min_humanities, mean_APC_min_mathematics, mean_APC_min_medical, mean_APC_min_physical, mean_APC_min_social, mean_APC_total_min, mean_APC_total_min_final, 
                          mean_APC_max_biology, mean_APC_max_chemistry, mean_APC_max_computer, mean_APC_max_earth, mean_APC_max_engineering, mean_APC_max_humanities, mean_APC_max_mathematics, mean_APC_max_medical, mean_APC_max_physical, mean_APC_max_social, mean_APC_total_max, mean_APC_total_max_final,
                          "Average APC", "de la moyenne des APC",
                          615, 2800, "37.prev_S1_discipline_mean")


Nombre d’articles

plot_discipline_predictions(nb_articles_biology, nb_articles_chemistry, nb_articles_computer, nb_articles_earth, nb_articles_engineering, nb_articles_humanities, nb_articles_mathematics, nb_articles_medical, nb_articles_physical, nb_articles_social, nb_articles_total, nb_articles_total_final, 
                          nb_articles_min_biology, nb_articles_min_chemistry, nb_articles_min_computer, nb_articles_min_earth, nb_articles_min_engineering, nb_articles_min_humanities, nb_articles_min_mathematics, nb_articles_min_medical, nb_articles_min_physical, nb_articles_min_social, nb_articles_total_min, nb_articles_total_min_final, 
                          nb_articles_max_biology, nb_articles_max_chemistry, nb_articles_max_computer, nb_articles_max_earth, nb_articles_max_engineering, nb_articles_max_humanities, nb_articles_max_mathematics, nb_articles_max_medical, nb_articles_max_physical, nb_articles_max_social, nb_articles_total_max, nb_articles_total_max_final,
                          "Number of articles", "du nombre d'articles",
                          0, 30000, "38.prev_S1_discipline_articles")


Montant total

plot_discipline_predictions(montant_apc_biology, montant_apc_chemistry, montant_apc_computer, montant_apc_earth, montant_apc_engineering, montant_apc_humanities, montant_apc_mathematics, montant_apc_medical, montant_apc_physical, montant_apc_social, montant_apc_total, montant_apc_total_final, 
                          montant_apc_min_biology, montant_apc_min_chemistry, montant_apc_min_computer, montant_apc_min_earth, montant_apc_min_engineering, montant_apc_min_humanities, montant_apc_min_mathematics, montant_apc_min_medical, montant_apc_min_physical, montant_apc_min_social, montant_apc_total_min, montant_apc_total_min_final, 
                          montant_apc_max_biology, montant_apc_max_chemistry, montant_apc_max_computer, montant_apc_max_earth, montant_apc_max_engineering, montant_apc_max_humanities, montant_apc_max_mathematics, montant_apc_max_medical, montant_apc_max_physical, montant_apc_max_social, montant_apc_total_max, montant_apc_total_max_final,
                          "Total cost of APCs", "du montant total d'APC payés",
                          0, 72000000, "39.prev_S1_discipline_total")


Tableau des prévisions

prev <- predictions_all_discipline %>% select(1:(ncol(predictions_all_discipline)-9))
prev %>% DT::datatable(options = list(pageLength = 9, searching = F, scrollY = "390px"), fillContainer = T)


Résultats totaux par variable considérée

Moyenne APC

plot_all_predictions <- function(predictions_tier, predictions_color, predictions_couperin, predictions_discipline, 
                                 predictions_min_tier, predictions_min_color, predictions_min_couperin, predictions_min_discipline, 
                                 predictions_max_tier, predictions_max_color, predictions_max_couperin, predictions_max_discipline, 
                                 nom_y_label, nom_title,
                                 ymin, ymax, name){
    
    graph <- predictions_all_whole_models %>% ggplot(aes(x=year)) + 
        geom_line(aes(y = {{predictions_color}}, color = "OA color")) +
        geom_line(aes(y = {{predictions_couperin}}, color = "couperin")) +
        geom_line(aes(y = {{predictions_discipline}}, color = "discipline")) +
        geom_point(aes(y = {{predictions_tier}}), col = "red", size = 1.2) +
        geom_line(aes(y = {{predictions_tier}}, color = "tier"), size = .58) +
        geom_ribbon(aes(ymin = {{predictions_min_tier}}, ymax = {{predictions_max_tier}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_color}}, ymax = {{predictions_max_color}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_couperin}}, ymax = {{predictions_max_couperin}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_discipline}}, ymax = {{predictions_max_discipline}}), alpha = 0.1) +
        scale_color_manual(breaks = c("couperin", "OA color", "discipline", "tier"),
                           values = c("tier" = "red",
                                      "OA color" = "#FFCC00",
                                      "couperin" = "#009E73",
                                      "discipline" = "#0066CC")) +
        theme_classic() +
        theme(legend.position = "right", axis.title.x = element_blank()) +
        labs(color = "", y = nom_y_label, title = paste("Évolution", nom_title, "par année et par modèle"), color = " ") +
        scale_y_continuous(labels = scales::comma, limits = c(ymin, ymax)) +
        scale_x_continuous(breaks=seq(2014, 2030, 2)) +
        geom_vline(xintercept = 2020, linetype = "dashed", color = "black", size = .5) +
        ggrepel::geom_text_repel(data = predictions_all_whole_models %>% filter(year %% 2 == 0),
                                 aes(y = {{predictions_tier}}, label = ifelse({{predictions_tier}} < 1000000, 
                                                                         round({{predictions_tier}}, 0), 
                                                                         paste("  ", format(round({{predictions_tier}} / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3)
    saving_plot(name, graph)
    graph

}
plot_all_predictions(mean_APC_total_tier, mean_APC_total_color, mean_APC_total_couperin, mean_APC_total_discipline,
                     mean_APC_total_min_tier, mean_APC_total_min_color, mean_APC_total_min_couperin, mean_APC_total_min_discipline,
                     mean_APC_total_max_tier, mean_APC_total_max_color, mean_APC_total_max_couperin, mean_APC_total_max_discipline,
                     "Average APC", "de la moyenne des APC", 
                     615, 2800, "40.prev_S1_resume_mean")


Nombre d’articles

plot_all_predictions(nb_articles_total_tier, nb_articles_total_color, nb_articles_total_couperin, nb_articles_total_discipline,
                     nb_articles_total_min_tier, nb_articles_total_min_color, nb_articles_total_min_couperin, nb_articles_total_min_discipline,
                     nb_articles_total_max_tier, nb_articles_total_max_color, nb_articles_total_max_couperin, nb_articles_total_max_discipline,
                     "Number of articles", "du nombre d'articles",
                     0, 30000, "41.prev_S1_resume_articles")


Montant total

plot_all_predictions(montant_apc_total_tier, montant_apc_total_color, montant_apc_total_couperin, montant_apc_total_discipline,
                     montant_apc_total_min_tier, montant_apc_total_min_color, montant_apc_total_min_couperin, montant_apc_total_min_discipline,
                     montant_apc_total_max_tier, montant_apc_total_max_color, montant_apc_total_max_couperin, montant_apc_total_max_discipline,
                     "Total cost of APCs", "du montant total d'APC payés",
                     0, 72000000, "42.prev_S1_resume_total")


Tableau des prévisions

predictions_all_whole_models %>% DT::datatable(options = list(pageLength = 9, searching = F, scrollY = "390px"), fillContainer = T)


Scénario 2.A : emballement


Un deuxième scénario vise à quantifier le montant des APC en cas “d’emballement” des articles gold : que ce soit en moyenne d’APC ou en nombre d’articles.

Dans un premier temps, cette évolution est quantifiée de la manière suivante :

  • moyenne des APC

    • l’évolution des APC moyens pour les articles gold est multipliée par 2, soit une hausse de 100% (passant de +47.6 à +95.2 euros par an) ;
    • l’évolution des APC moyens pour les articles hybrid reste constante (soit -12.6 euros par an) ;
  • nombre d’articles

    • l’évolution du nombre d’articles gold est multipliée par 1.25, soit une hausse de 25% (passant de +967 à +1209 articles par an) ;
    • l’évolution du nombre d’articles hybrid reste constante (soit 2014 articles, comme en 2020).


Dans un second temps, les prévisions sont calculées en faisant varier l’évolution du nombre d’articles gold ; non plus une hausse de 25%, mais une hausse de 20% puis de 30%.

Pour ces 3 prévisions, la formule pour calculer le montant total des APC est la même que le scénario précédent, à savoir :

montant total = APC moyen x nombre d’articles


Visualisation des prévisions - hausse de 25%

Moyenne APC

plot_color_predictions_2A_1.25 <- function(predictions_gold, predictions_hybridOA, predictions_total,  
                                   predictions_min_gold, predictions_min_hybridOA, predictions_min_total,
                                   predictions_max_gold, predictions_max_hybridOA, predictions_max_total, 
                                   nom_y_label, nom_title,
                                   ymin, ymax, name){
    
    graph <- predictions_all_color_2A_1.25 %>% ggplot(aes(x=year)) + 
        geom_line(aes(y = {{predictions_gold}}, color = "gold")) +
        geom_line(aes(y = {{predictions_hybridOA}}, color = "hybridOA")) +
        geom_line(aes(y = {{predictions_total}}, color = "overall")) +
        geom_point(aes(y = {{predictions_total}}), col = "red", size = 1.2) +
        geom_ribbon(aes(ymin = {{predictions_min_gold}}, ymax = {{predictions_max_gold}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_hybridOA}}, ymax = {{predictions_max_hybridOA}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_total}}, ymax = {{predictions_max_total}}), alpha = 0.1) +
        scale_color_manual(values = c("overall" = "red",
                                      "gold" = "#FFCC00",
                                      "hybridOA" = "#0072B2")) +
        theme_classic() +
        theme(legend.position = "right", axis.title.x = element_blank()) +
        labs(color = "", y = nom_y_label, title = paste("Évolution", nom_title, "par année et par couleur"), color = " ") +
        scale_y_continuous(labels = scales::comma, limits = c(ymin, ymax)) +
        scale_x_continuous(breaks=seq(2014, 2030, 2)) +
        geom_vline(xintercept = 2020, linetype = "dashed", color = "black", size = .5) +
        ggrepel::geom_text_repel(data = predictions_all_color_2A_1.25 %>% filter(year %% 2 == 0),
                                 aes(y = {{predictions_total}}, label = ifelse({{predictions_total}} < 1000000, 
                                                                         round({{predictions_total}}, 0), 
                                                                         paste("  ", format(round({{predictions_total}} / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3) +
        ggrepel::geom_text_repel(data = predictions_all_color_2A_1.25 %>% filter(year == 2030),
                                 aes(y = {{predictions_hybridOA}}, label = ifelse({{predictions_hybridOA}} < 1000000, 
                                                                         round({{predictions_hybridOA}}, 0), 
                                                                         paste(" ", format(round({{predictions_hybridOA}} / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3)
    saving_plot(name, graph)
    graph

}
plot_color_predictions_2A_1.25(mean_APC_gold_2A_1.25, mean_APC_hybridOA_2A_1.25, mean_APC_total, 
                      mean_APC_min_gold_2A_1.25, mean_APC_min_hybridOA_2A_1.25, mean_APC_total_min, 
                      mean_APC_max_gold_2A_1.25, mean_APC_max_hybridOA_2A_1.25, mean_APC_total_max, 
                      "Average APC", "de la moyenne des APC",
                      615, 2800, "43.prev_S2A_1.25_mean")


Nombre d’articles

plot_color_predictions_2A_1.25(nb_articles_gold_2A_1.25, nb_articles_hybridOA_2A_1.25, nb_articles_total, 
                      nb_articles_min_gold_2A_1.25, nb_articles_min_hybridOA_2A_1.25, nb_articles_total_min, 
                      nb_articles_max_gold_2A_1.25, nb_articles_max_hybridOA_2A_1.25, nb_articles_total_max, 
                      "Number of articles", "du nombre d'articles",
                      0, 30000, "44.prev_S2A_1.25_articles")


Montant total

plot_color_predictions_2A_1.25(montant_apc_gold_2A_1.25, montant_apc_hybridOA_2A_1.25, montant_apc_total, 
                      montant_apc_min_gold_2A_1.25, montant_apc_min_hybridOA_2A_1.25, montant_apc_total_min, 
                      montant_apc_max_gold_2A_1.25, montant_apc_max_hybridOA_2A_1.25, montant_apc_total_max, 
                      "Total cost of APCs", "du montant total d'APC payés",
                      0, 72000000, "45.prev_S2A_1.25_total")


Tableau des prévisions

predictions_all_color_2A_1.25 %>% DT::datatable(options = list(pageLength = 9, searching = F, scrollY = "390px"), fillContainer = T)


Visualisation des prévisions - hausse de 20%

Moyenne APC

plot_color_predictions_2A_1.2 <- function(predictions_gold, predictions_hybridOA, predictions_total,  
                                   predictions_min_gold, predictions_min_hybridOA, predictions_min_total,
                                   predictions_max_gold, predictions_max_hybridOA, predictions_max_total, 
                                   nom_y_label, nom_title,
                                   ymin, ymax, name){
    
    graph <- predictions_all_color_2A_1.2 %>% ggplot(aes(x=year)) + 
        geom_line(aes(y = {{predictions_gold}}, color = "gold")) +
        geom_line(aes(y = {{predictions_hybridOA}}, color = "hybridOA")) +
        geom_line(aes(y = {{predictions_total}}, color = "overall")) +
        geom_point(aes(y = {{predictions_total}}), col = "red", size = 1.2) +
        geom_ribbon(aes(ymin = {{predictions_min_gold}}, ymax = {{predictions_max_gold}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_hybridOA}}, ymax = {{predictions_max_hybridOA}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_total}}, ymax = {{predictions_max_total}}), alpha = 0.1) +
        scale_color_manual(values = c("overall" = "red",
                                      "gold" = "#FFCC00",
                                      "hybridOA" = "#0072B2")) +
        theme_classic() +
        theme(legend.position = "right", axis.title.x = element_blank()) +
        labs(color = "", y = nom_y_label, title = paste("Évolution", nom_title, "par année et par couleur"), color = " ") +
        scale_y_continuous(labels = scales::comma, limits = c(ymin, ymax)) +
        scale_x_continuous(breaks=seq(2014, 2030, 2)) +
        geom_vline(xintercept = 2020, linetype = "dashed", color = "black", size = .5) +
        ggrepel::geom_text_repel(data = predictions_all_color_2A_1.2 %>% filter(year %% 2 == 0),
                                 aes(y = {{predictions_total}}, label = ifelse({{predictions_total}} < 1000000, 
                                                                         round({{predictions_total}}, 0), 
                                                                         paste("  ", format(round({{predictions_total}} / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3) +
        ggrepel::geom_text_repel(data = predictions_all_color_2A_1.2 %>% filter(year == 2030),
                                 aes(y = {{predictions_hybridOA}}, label = ifelse({{predictions_hybridOA}} < 1000000, 
                                                                         round({{predictions_hybridOA}}, 0), 
                                                                         paste(" ", format(round({{predictions_hybridOA}} / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3)
    saving_plot(name, graph)
    graph

}
plot_color_predictions_2A_1.2(mean_APC_gold_2A_1.2, mean_APC_hybridOA_2A_1.2, mean_APC_total, 
                      mean_APC_min_gold_2A_1.2, mean_APC_min_hybridOA_2A_1.2, mean_APC_total_min, 
                      mean_APC_max_gold_2A_1.2, mean_APC_max_hybridOA_2A_1.2, mean_APC_total_max, 
                      "Average APC", "de la moyenne des APC",
                      615, 2800, "46.prev_S2A_1.2_mean")


Nombre d’articles

plot_color_predictions_2A_1.2(nb_articles_gold_2A_1.2, nb_articles_hybridOA_2A_1.2, nb_articles_total, 
                      nb_articles_min_gold_2A_1.2, nb_articles_min_hybridOA_2A_1.2, nb_articles_total_min, 
                      nb_articles_max_gold_2A_1.2, nb_articles_max_hybridOA_2A_1.2, nb_articles_total_max, 
                      "Number of articles", "du nombre d'articles",
                      0, 30000, "47.prev_S2A_1.2_articles")


Montant total

plot_color_predictions_2A_1.2(montant_apc_gold_2A_1.2, montant_apc_hybridOA_2A_1.2, montant_apc_total, 
                      montant_apc_min_gold_2A_1.2, montant_apc_min_hybridOA_2A_1.2, montant_apc_total_min, 
                      montant_apc_max_gold_2A_1.2, montant_apc_max_hybridOA_2A_1.2, montant_apc_total_max, 
                      "Total cost of APCs", "du montant total d'APC payés",
                      0, 72000000, "48.prev_S2A_1.2_total")


Tableau des prévisions

predictions_all_color_2A_1.2 %>% DT::datatable(options = list(pageLength = 9, searching = F, scrollY = "390px"), fillContainer = T)


Visualisation des prévisions - hausse de 30%

Moyenne APC

plot_color_predictions_2A_1.3 <- function(predictions_gold, predictions_hybridOA, predictions_total,  
                                   predictions_min_gold, predictions_min_hybridOA, predictions_min_total,
                                   predictions_max_gold, predictions_max_hybridOA, predictions_max_total, 
                                   nom_y_label, nom_title,
                                   ymin, ymax, name){
    
    graph <- predictions_all_color_2A_1.3 %>% ggplot(aes(x=year)) + 
        geom_line(aes(y = {{predictions_gold}}, color = "gold")) +
        geom_line(aes(y = {{predictions_hybridOA}}, color = "hybridOA")) +
        geom_line(aes(y = {{predictions_total}}, color = "overall")) +
        geom_point(aes(y = {{predictions_total}}), col = "red", size = 1.2) +
        geom_ribbon(aes(ymin = {{predictions_min_gold}}, ymax = {{predictions_max_gold}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_hybridOA}}, ymax = {{predictions_max_hybridOA}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_total}}, ymax = {{predictions_max_total}}), alpha = 0.1) +
        scale_color_manual(values = c("overall" = "red",
                                      "gold" = "#FFCC00",
                                      "hybridOA" = "#0072B2")) +
        theme_classic() +
        theme(legend.position = "right", axis.title.x = element_blank()) +
        labs(color = "", y = nom_y_label, title = paste("Évolution", nom_title, "par année et par couleur"), color = " ") +
        scale_y_continuous(labels = scales::comma, limits = c(ymin, ymax)) +
        scale_x_continuous(breaks=seq(2014, 2030, 2)) +
        geom_vline(xintercept = 2020, linetype = "dashed", color = "black", size = .5) +
        ggrepel::geom_text_repel(data = predictions_all_color_2A_1.3 %>% filter(year %% 2 == 0),
                                 aes(y = {{predictions_total}}, label = ifelse({{predictions_total}} < 1000000, 
                                                                         round({{predictions_total}}, 0), 
                                                                         paste("  ", format(round({{predictions_total}} / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3) +
        ggrepel::geom_text_repel(data = predictions_all_color_2A_1.3 %>% filter(year == 2030),
                                 aes(y = {{predictions_hybridOA}}, label = ifelse({{predictions_hybridOA}} < 1000000, 
                                                                         round({{predictions_hybridOA}}, 0), 
                                                                         paste(" ", format(round({{predictions_hybridOA}} / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3)
    saving_plot(name, graph)
    graph

}
plot_color_predictions_2A_1.3(mean_APC_gold_2A_1.3, mean_APC_hybridOA_2A_1.3, mean_APC_total, 
                      mean_APC_min_gold_2A_1.3, mean_APC_min_hybridOA_2A_1.3, mean_APC_total_min, 
                      mean_APC_max_gold_2A_1.3, mean_APC_max_hybridOA_2A_1.3, mean_APC_total_max, 
                      "Average APC", "de la moyenne des APC",
                      615, 2800, "49.prev_S2A_1.3_mean")


Nombre d’articles

plot_color_predictions_2A_1.3(nb_articles_gold_2A_1.3, nb_articles_hybridOA_2A_1.3, nb_articles_total, 
                      nb_articles_min_gold_2A_1.3, nb_articles_min_hybridOA_2A_1.3, nb_articles_total_min, 
                      nb_articles_max_gold_2A_1.3, nb_articles_max_hybridOA_2A_1.3, nb_articles_total_max, 
                      "Number of articles", "du nombre d'articles",
                      0, 30000, "50.prev_S2A_1.3_articles")


Montant total

plot_color_predictions_2A_1.3(montant_apc_gold_2A_1.3, montant_apc_hybridOA_2A_1.3, montant_apc_total, 
                      montant_apc_min_gold_2A_1.3, montant_apc_min_hybridOA_2A_1.3, montant_apc_total_min, 
                      montant_apc_max_gold_2A_1.3, montant_apc_max_hybridOA_2A_1.3, montant_apc_total_max, 
                      "Total cost of APCs", "du montant total d'APC payés",
                      0, 72000000, "51.prev_S2A_1.3_total")


Tableau des prévisions

predictions_all_color_2A_1.3 %>% DT::datatable(options = list(pageLength = 9, searching = F, scrollY = "390px"), fillContainer = T)


Scénario 2.B : apaisement


Un troisième scénario vise à quantifier le montant des APC en cas “d’apaisement”, où la proportion d’articles hybrid diminue tandis que celle des articles gold augmente.

Dans un premier temps, cette évolution est quantifiée de la manière suivante :

  • moyenne des APC

    • l’évolution des APC moyens pour les articles hybrid reste constante (soit -12.6 euros par an) ;
    • l’évolution des APC moyens pour les articles gold reste constante (soit +47.6 euros par an) ;
  • nombre d’articles

    • le nombre d’articles hybrid est multiplié par 0.9 d’une année à l’autre, soit une baisse de 10% (avec 2014 articles hybrid en 2020, il y en aurait alors 1813 en 2021) ;
    • tous les articles qui ne sont pas hybrid sont considérés comme gold.


Dans un second temps, les prévisions sont calculées en faisant varier l’évolution du nombre d’articles hybrid ; non plus une baisse de 10%, mais une baisse de 15% puis de 5%.

Étant donné que les coefficients d’évolution du nombre d’articles d’une année à l’autre sont modifiés, les valeurs maximales et minimales (bornes inférieures et supérieures) ne sont plus disponibles. Nous les définissons à la main, à plus ou moins 5% du coefficient défini. Ainsi, pour une baisse de 10% du nombre d’articles hybrid, les bornes inférieures et supérieures seront 5 et 15%.


Visualisation des prévisions - baisse de 10%

Moyenne APC

plot_color_predictions_2B_0.9 <- function(predictions_gold, predictions_hybridOA, predictions_total,  
                                   predictions_min_gold, predictions_min_hybridOA, predictions_min_total,
                                   predictions_max_gold, predictions_max_hybridOA, predictions_max_total, 
                                   nom_y_label, nom_title,
                                   ymin, ymax, name){
    
    graph <- predictions_all_color_2B_0.9 %>% ggplot(aes(x=year)) + 
        geom_line(aes(y = {{predictions_gold}}, color = "gold")) +
        geom_line(aes(y = {{predictions_hybridOA}}, color = "hybridOA")) +
        geom_line(aes(y = {{predictions_total}}, color = "overall")) +
        geom_point(aes(y = {{predictions_total}}), col = "red", size = 1.2) +
        geom_ribbon(aes(ymin = {{predictions_min_gold}}, ymax = {{predictions_max_gold}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_hybridOA}}, ymax = {{predictions_max_hybridOA}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_total}}, ymax = {{predictions_max_total}}), alpha = 0.1) +
        scale_color_manual(values = c("overall" = "red",
                                      "gold" = "#FFCC00",
                                      "hybridOA" = "#0072B2")) +
        theme_classic()+
        theme(legend.position = "right", axis.title.x = element_blank()) +
        labs(color = "", y = nom_y_label, title = paste("Évolution", nom_title, "par année et par couleur"), color = " ") +
        scale_y_continuous(labels = scales::comma, limits = c(ymin, ymax)) +
        scale_x_continuous(breaks=seq(2014, 2030, 2)) +
        geom_vline(xintercept = 2020, linetype = "dashed", color = "black", size = .5) +
        ggrepel::geom_text_repel(data = predictions_all_color_2B_0.9 %>% filter(year %% 2 == 0),
                                 aes(y = {{predictions_total}}, label = ifelse({{predictions_total}} < 1000000, 
                                                                         round({{predictions_total}}, 0), 
                                                                         paste("  ", format(round({{predictions_total}} / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3) +
        ggrepel::geom_text_repel(data = predictions_all_color_2B_0.9 %>% filter(year == 2030),
                                 aes(y = {{predictions_hybridOA}}, label = ifelse({{predictions_hybridOA}} < 1000000, 
                                                                         round({{predictions_hybridOA}}, 0), 
                                                                         paste(" ", format(round({{predictions_hybridOA}} / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3)
    saving_plot(name, graph)
    graph

}
plot_color_predictions_2B_0.9(mean_APC_gold_2B_0.9, mean_APC_hybridOA_2B_0.9, mean_APC_total, 
                      mean_APC_min_gold_2B_0.9, mean_APC_min_hybridOA_2B_0.9, mean_APC_total_min, 
                      mean_APC_max_gold_2B_0.9, mean_APC_max_hybridOA_2B_0.9, mean_APC_total_max, 
                      "Average APC", "de la moyenne des APC",
                      615, 2800, "52.prev_S2B_0.9_mean")


Nombre d’articles

plot_color_predictions_2B_0.9(nb_articles_gold_2B_0.9, nb_articles_hybridOA_2B_0.9, nb_articles_total, 
                      nb_articles_min_gold_2B_0.9, nb_articles_min_hybridOA_2B_0.9, nb_articles_total_min, 
                      nb_articles_max_gold_2B_0.9, nb_articles_max_hybridOA_2B_0.9, nb_articles_total_max, 
                      "Number of articles", "du nombre d'articles",
                      0, 30000, "53.prev_S2B_0.9_articles")


Montant total

plot_color_predictions_2B_0.9(montant_apc_gold_2B_0.9, montant_apc_hybridOA_2B_0.9, montant_apc_total, 
                      montant_apc_min_gold_2B_0.9, montant_apc_min_hybridOA_2B_0.9, montant_apc_total_min, 
                      montant_apc_max_gold_2B_0.9, montant_apc_max_hybridOA_2B_0.9, montant_apc_total_max, 
                      "Total cost of APCs", "du montant total d'APC payés",
                      0, 72000000, "54.prev_S2B_0.9_total")


Tableau des prévisions

predictions_all_color_2B_0.9 %>% DT::datatable(options = list(pageLength = 9, searching = F, scrollY = "390px"), fillContainer = T)


Visualisation des prévisions - baisse de 15%

Moyenne APC

plot_color_predictions_2B_0.85 <- function(predictions_gold, predictions_hybridOA, predictions_total,  
                                   predictions_min_gold, predictions_min_hybridOA, predictions_min_total,
                                   predictions_max_gold, predictions_max_hybridOA, predictions_max_total, 
                                   nom_y_label, nom_title,
                                   ymin, ymax, name){
    
    graph <- predictions_all_color_2B_0.85 %>% ggplot(aes(x=year)) + 
        geom_line(aes(y = {{predictions_gold}}, color = "gold")) +
        geom_line(aes(y = {{predictions_hybridOA}}, color = "hybridOA")) +
        geom_line(aes(y = {{predictions_total}}, color = "overall")) +
        geom_point(aes(y = {{predictions_total}}), col = "red", size = 1.2) +
        geom_ribbon(aes(ymin = {{predictions_min_gold}}, ymax = {{predictions_max_gold}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_hybridOA}}, ymax = {{predictions_max_hybridOA}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_total}}, ymax = {{predictions_max_total}}), alpha = 0.1) +
        scale_color_manual(values = c("overall" = "red",
                                      "gold" = "#FFCC00",
                                      "hybridOA" = "#0072B2")) +
        theme_classic() +
        theme(legend.position = "right", axis.title.x = element_blank()) +
        labs(color = "", y = nom_y_label, title = paste("Évolution", nom_title, "par année et par couleur"), color = " ") +
        scale_y_continuous(labels = scales::comma, limits = c(ymin, ymax)) +
        scale_x_continuous(breaks=seq(2014, 2030, 2)) +
        geom_vline(xintercept = 2020, linetype = "dashed", color = "black", size = .5) +
        ggrepel::geom_text_repel(data = predictions_all_color_2B_0.85 %>% filter(year %% 2 == 0),
                                 aes(y = {{predictions_total}}, label = ifelse({{predictions_total}} < 1000000, 
                                                                         round({{predictions_total}}, 0), 
                                                                         paste("  ", format(round({{predictions_total}} / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3) +
        ggrepel::geom_text_repel(data = predictions_all_color_2B_0.85 %>% filter(year == 2030),
                                 aes(y = {{predictions_hybridOA}}, label = ifelse({{predictions_hybridOA}} < 800000, 
                                                                         round({{predictions_hybridOA}}, 0), 
                                                                         paste(" ", format(round({{predictions_hybridOA}} / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3)
    saving_plot(name, graph)
    graph

}
plot_color_predictions_2B_0.85(mean_APC_gold_2B_0.85, mean_APC_hybridOA_2B_0.85, mean_APC_total, 
                      mean_APC_min_gold_2B_0.85, mean_APC_min_hybridOA_2B_0.85, mean_APC_total_min, 
                      mean_APC_max_gold_2B_0.85, mean_APC_max_hybridOA_2B_0.85, mean_APC_total_max, 
                      "Average APC", "de la moyenne des APC",
                      615, 2800, "55.prev_S2B_0.85_mean")


Nombre d’articles

plot_color_predictions_2B_0.85(nb_articles_gold_2B_0.85, nb_articles_hybridOA_2B_0.85, nb_articles_total, 
                      nb_articles_min_gold_2B_0.85, nb_articles_min_hybridOA_2B_0.85, nb_articles_total_min, 
                      nb_articles_max_gold_2B_0.85, nb_articles_max_hybridOA_2B_0.85, nb_articles_total_max, 
                      "Number of articles", "du nombre d'articles",
                      0, 30000, "56.prev_S2B_0.85_articles")


Montant total

plot_color_predictions_2B_0.85(montant_apc_gold_2B_0.85, montant_apc_hybridOA_2B_0.85, montant_apc_total, 
                      montant_apc_min_gold_2B_0.85, montant_apc_min_hybridOA_2B_0.85, montant_apc_total_min, 
                      montant_apc_max_gold_2B_0.85, montant_apc_max_hybridOA_2B_0.85, montant_apc_total_max, 
                      "Total cost of APCs", "du montant total d'APC payés",
                      0, 72000000, "57.prev_S2B_0.85_total")


Tableau des prévisions

predictions_all_color_2B_0.85 %>% DT::datatable(options = list(pageLength = 9, searching = F, scrollY = "390px"), fillContainer = T)


Visualisation des prévisions - baisse de 5%

Moyenne APC

plot_color_predictions_2B_0.95 <- function(predictions_gold, predictions_hybridOA, predictions_total,  
                                   predictions_min_gold, predictions_min_hybridOA, predictions_min_total,
                                   predictions_max_gold, predictions_max_hybridOA, predictions_max_total, 
                                   nom_y_label, nom_title,
                                   ymin, ymax, name){
    
    graph <- predictions_all_color_2B_0.95 %>% ggplot(aes(x=year)) + 
        geom_line(aes(y = {{predictions_gold}}, color = "gold")) +
        geom_line(aes(y = {{predictions_hybridOA}}, color = "hybridOA")) +
        geom_line(aes(y = {{predictions_total}}, color = "overall")) +
        geom_point(aes(y = {{predictions_total}}), col = "red", size = 1.2) +
        geom_ribbon(aes(ymin = {{predictions_min_gold}}, ymax = {{predictions_max_gold}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_hybridOA}}, ymax = {{predictions_max_hybridOA}}), alpha = 0.1) +
        geom_ribbon(aes(ymin = {{predictions_min_total}}, ymax = {{predictions_max_total}}), alpha = 0.1) +
        scale_color_manual(values = c("overall" = "red",
                                      "gold" = "#FFCC00",
                                      "hybridOA" = "#0072B2")) +
        theme_classic() +
        theme(legend.position = "right", axis.title.x = element_blank()) +
        labs(color = "", y = nom_y_label, title = paste("Évolution", nom_title, "par année et par couleur"), color = " ") +
        scale_y_continuous(labels = scales::comma, limits = c(ymin, ymax)) +
        scale_x_continuous(breaks=seq(2014, 2030, 2)) +
        geom_vline(xintercept = 2020, linetype = "dashed", color = "black", size = .5) +
        ggrepel::geom_text_repel(data = predictions_all_color_2B_0.95 %>% filter(year %% 2 == 0),
                                 aes(y = {{predictions_total}}, label = ifelse({{predictions_total}} < 1000000, 
                                                                         round({{predictions_total}}, 0), 
                                                                         paste("  ", format(round({{predictions_total}} / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3) +
        ggrepel::geom_text_repel(data = predictions_all_color_2B_0.95 %>% filter(year == 2030),
                                 aes(y = {{predictions_hybridOA}}, label = ifelse({{predictions_hybridOA}} < 1000000, 
                                                                         round({{predictions_hybridOA}}, 0), 
                                                                         paste(" ", format(round({{predictions_hybridOA}} / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3)
    saving_plot(name, graph)
    graph

}
plot_color_predictions_2B_0.95(mean_APC_gold_2B_0.95, mean_APC_hybridOA_2B_0.95, mean_APC_total, 
                      mean_APC_min_gold_2B_0.95, mean_APC_min_hybridOA_2B_0.95, mean_APC_total_min, 
                      mean_APC_max_gold_2B_0.95, mean_APC_max_hybridOA_2B_0.95, mean_APC_total_max, 
                      "Average APC", "de la moyenne des APC",
                      615, 2800, "58.prev_S2B_0.95_mean")


Nombre d’articles

plot_color_predictions_2B_0.95(nb_articles_gold_2B_0.95, nb_articles_hybridOA_2B_0.95, nb_articles_total, 
                      nb_articles_min_gold_2B_0.95, nb_articles_min_hybridOA_2B_0.95, nb_articles_total_min, 
                      nb_articles_max_gold_2B_0.95, nb_articles_max_hybridOA_2B_0.95, nb_articles_total_max, 
                      "Number of articles", "du nombre d'articles",
                      0, 30000, "59.prev_S2B_0.95_articles")


Montant total

plot_color_predictions_2B_0.95(montant_apc_gold_2B_0.95, montant_apc_hybridOA_2B_0.95, montant_apc_total, 
                      montant_apc_min_gold_2B_0.95, montant_apc_min_hybridOA_2B_0.95, montant_apc_total_min, 
                      montant_apc_max_gold_2B_0.95, montant_apc_max_hybridOA_2B_0.95, montant_apc_total_max, 
                      "Total cost of APCs", "du montant total d'APC payés",
                      0, 72000000, "60.prev_S2B_0.95_total")


Tableau des prévisions

predictions_all_color_2B_0.95 %>% DT::datatable(options = list(pageLength = 9, searching = F, scrollY = "390px"), fillContainer = T)


Simulation 3 : full global OA


Une dernière simulation vise à quantifier le montant des APC dans une hypothèse de full global OA sur le site de l’éditeur.

Le but de cette simulation est de prévoir le coût total des APC si tous les articles ayant un auteur correspondant français payaient des APC. Cette simulation est quantifiée de la manière suivante :

  • moyenne des APC : l’évolution des APC moyens est récupérée du scénario 1, par tier ;

  • nombre d’articles : 90% du nombre d’articles ayant un CA français (et non plus les articles avec un CA français et où un APC a été payé) est prédit par tier (les 90% quantifient une baisse de 10% d’articles diamond).

On part donc des APC moyens prédits en scénario 1 que l’on va multiplier par 90% du nombre d’articles ayant un CA français.


Visualisation des prévisions

Moyenne APC (scénario 1)

plot_tier_predictions(predictions_all_S3,
                      mean_APC_1, mean_APC_2, mean_APC_3, mean_APC_4, mean_APC_total, 
                      mean_APC_min_1, mean_APC_min_2, mean_APC_min_3, mean_APC_min_4, mean_APC_total_min, 
                      mean_APC_max_1, mean_APC_max_2, mean_APC_max_3, mean_APC_max_4, mean_APC_total_max, 
                      "Average APC", "de la moyenne des APC",
                      615, 2800, "61.prev_S3_mean")


Nombre d’articles (extrapolé)

plot_tier_predictions(predictions_all_S3,
                      nb_articles_1, nb_articles_2, nb_articles_3, nb_articles_4, nb_articles_total, 
                      nb_articles_min_1, nb_articles_min_2, nb_articles_min_3, nb_articles_min_4, nb_articles_total_min, 
                      nb_articles_max_1, nb_articles_max_2, nb_articles_max_3, nb_articles_max_4, nb_articles_total_max, 
                      "Number of articles", "du nombre d'articles",
                      0, NA, "62.prev_S3_articles")


Montant total

na_2020 <- cbind(predictions_all_S3 %>% mutate(montant_apc_1 = ifelse(year < 2021, NA, montant_apc_1),
                                             montant_apc_min_1 = ifelse(year < 2021, NA, montant_apc_min_1),
                                             montant_apc_2 = ifelse(year < 2021, NA, montant_apc_2),
                                             montant_apc_min_2 = ifelse(year < 2021, NA, montant_apc_min_2),
                                             montant_apc_3 = ifelse(year < 2021, NA, montant_apc_3),
                                             montant_apc_min_3 = ifelse(year < 2021, NA, montant_apc_min_3),
                                             montant_apc_4 = ifelse(year < 2021, NA, montant_apc_4),
                                             montant_apc_min_4 = ifelse(year < 2021, NA, montant_apc_min_4),
                                             montant_apc_total = ifelse(year < 2021, NA, montant_apc_total),
                                             montant_apc_total_min = ifelse(year < 2021, NA, montant_apc_total_min),
                                             montant_apc_total_max = ifelse(year < 2021, NA, montant_apc_total_max)), 
                 predictions_all %>% rename_at(vars(-year), ~ paste0(., "_S1")) %>% select(44:46))
graph <- na_2020 %>% ggplot(aes(x=year)) + 
        geom_line(aes(y = montant_apc_1, color = "tier 1")) +
        geom_line(aes(y = montant_apc_2, color = "tier 2")) +
        geom_line(aes(y = montant_apc_3, color = "tier 3")) +
        geom_line(aes(y = montant_apc_4, color = "tier 4")) +
        geom_ribbon(aes(ymin = montant_apc_min_1, ymax = montant_apc_max_1), alpha = 0.1) +
        geom_ribbon(aes(ymin = montant_apc_min_2, ymax = montant_apc_max_2), alpha = 0.1) +
        geom_ribbon(aes(ymin = montant_apc_min_3, ymax = montant_apc_max_3), alpha = 0.1) +
        geom_ribbon(aes(ymin = montant_apc_min_4, ymax = montant_apc_max_4), alpha = 0.1) +
        geom_line(aes(y = montant_apc_total, color = "overall")) +
        geom_point(aes(y = montant_apc_total), col = "black", size = 1.2) +
        geom_ribbon(aes(ymin = montant_apc_total_min, ymax = montant_apc_total_max), alpha = 0.1) +
        geom_line(aes(y = montant_apc_total_S1, color = "trends continue \nunchanged")) +
        geom_point(aes(y = montant_apc_total_S1), col = "red", size = 1.2) +
        geom_ribbon(aes(ymin = montant_apc_total_min_S1, ymax = montant_apc_total_max_S1), alpha = 0.1) +
        scale_color_manual(values = c("tier 1" = "#009E73",
                                      "tier 2" = "#660099",
                                      "tier 3" = "#0072B2",
                                      "tier 4" = "#D55E00",
                                      "overall" = "black",
                                      "trends continue \nunchanged" = "red")) +
        theme_classic()+
        theme(legend.position = "right", axis.title.x = element_blank()) +
        labs(color = "", y = "Total cost of APCs", title = "Évolution du montant total d'APC payés par année", color = " ") +
        scale_y_continuous(labels = scales::comma, limits = c(NA, NA)) +
        scale_x_continuous(breaks=seq(2014, 2030, 2)) +
        geom_vline(xintercept = 2020, linetype = "dashed", color = "black", size = .5) +
        ggrepel::geom_text_repel(data = na_2020 %>% filter(year %% 2 == 0),
                                 aes(y = montant_apc_total, label = ifelse(montant_apc_total < 1000000, 
                                                                         round(montant_apc_total, 0), 
                                                                         paste("  ", format(round(montant_apc_total / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3) +
        ggrepel::geom_text_repel(data = na_2020 %>% filter(year == 2030),
                                 aes(y = montant_apc_total_S1, label = ifelse(montant_apc_total_S1 < 1000000, 
                                                                         round(montant_apc_total_S1, 0), 
                                                                         paste("  ", format(round(montant_apc_total_S1 / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3)
graph
saving_plot("63.prev_S3_total", graph)


Tableau des prévisions

na_2020 %>% DT::datatable(options = list(pageLength = 9, searching = F, scrollY = "390px"), fillContainer = T)


Coût des abonnements


Cette analyse vise à prédire les futurs coûts d’abonnements des revues pour la période 2021-2030, en se basant sur les données du consortium Couperin. Celles-ci contiennent une estimation du coût annuel des abonnements aux revues scientifiques entre 2014 et 2020. Les étapes suivantes ont été suivies pour prédire les évolutions futures :

  • on estime la hausse annuelle du prix des abonnements via un modèle ;
  • on part de la dernière valeur connue pour 2020 : 87.5M, à laquelle on somme le coefficient estimé.
# Import Couperin's data
subscription_data <- read_excel("../data/3.external/Couperin/Abonnements_revues_ERE_2014-2021_VF.xlsx", sheet = "Synthèse ERE ", skip = 1) %>% 
    select(1,3) %>% dplyr::rename(year = `Année`, n = `Part abonnements   revues scientifiques*`) %>% filter(year > 2012 & year < 2022) %>% mutate(year = as.numeric(year) - 2014)

        # Growth rate


gr_subscription <- subscription_data %>% mutate(growth = round((n - lag(n))/lag(n)*100, 2)) %>% na.omit()

        # Annual price increase

# Modèle
model <- lm(n ~ year, data = subscription_data)
    # sauvegarde des coefficients estimés
beta <- tidy(model) %>%
    filter(term %in% "year") %>%
    mutate(estimate = estimate,  
           ymin = estimate - 2 * `std.error`, # intervalles à 95%
           ymax = estimate + 2 * `std.error`)

# Prévisions
    # dernière valeur connue : 87.5M en 2020
prev_subscription <- data.frame(year = 2020, n = 87500000) %>% complete(year = 2021:2030) %>% arrange(year) %>%
    mutate(n_min = n,  n_max = n)
    # prévisions 2013-2020
for(i in 2:11){
    prev_subscription[i,2] <- prev_subscription[i-1,2] + beta[,2]
    prev_subscription[i,3] <- prev_subscription[i-1,3] + beta[,6]
    prev_subscription[i,4] <- prev_subscription[i-1,4] + beta[,7]
}
    # visualisation
graph <- prev_subscription %>% ggplot(aes(x=year)) + 
        geom_line(aes(y = n), col = "red") +
        geom_point(aes(y = n), col = "red", size = 1.2) +
        geom_ribbon(aes(ymin = n_min, ymax = n_max), alpha = 0.1) +
        theme_classic() +
        theme(legend.position = "none", axis.title.x = element_blank()) +
        labs(color = "", y = "Subscription expenditures", title = "Évolution du coût des abonnements aux périodiques en euros") +
        scale_y_continuous(labels = scales::comma) +
        scale_x_continuous(breaks=seq(2020, 2030, 1)) +
        geom_vline(xintercept = 2020, linetype = "dashed", color = "black", size = .5) +
        ggrepel::geom_text_repel(data = prev_subscription %>% filter(year %% 2 == 0),
                                 aes(y = n, label = paste(" ", format(round(n / 1e6, 1), trim = TRUE), "M€")), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3)
graph
saving_plot("64.prev_subscriptions", graph)

Après avoir calculé le taux de croissance du coût des abonnements d’une année à l’autre, nous voyons que le taux de croissance moyen est de 1.76%, avec un taux minimal de -1.95 entre 2019 et 2020, et un taux maximal de 7.22 entre 2014 et 2015.


Résumé des prévisions


All

# Regroupement des prévisions
resume_prev <- cbind(predictions_all %>% select(1,44:46) %>% rename_at(vars(-year), ~ paste0(., "_1")), 
                     predictions_all_color_2A_1.25 %>% select(1,26:28) %>% rename_at(vars(-year), ~ paste0(., "_2A_1.25")),
                     predictions_all_color_2A_1.2 %>% select(1,26:28) %>% rename_at(vars(-year), ~ paste0(., "_2A_1.2")),
                     predictions_all_color_2A_1.3 %>% select(1,26:28) %>% rename_at(vars(-year), ~ paste0(., "_2A_1.3")),
                     predictions_all_color_2B_0.85 %>% select(1,26:28) %>% rename_at(vars(-year), ~ paste0(., "_2B_0.85")),
                     predictions_all_color_2B_0.9 %>% select(1,26:28) %>% rename_at(vars(-year), ~ paste0(., "_2B_0.9")),
                     predictions_all_color_2B_0.95 %>% select(1,26:28) %>% rename_at(vars(-year), ~ paste0(., "_2B_0.95")),
                     na_2020 %>% select(1,44:46) %>% rename_at(vars(-year), ~ paste0(., "_S3")),
                     prev_subscription %>% complete(year = 2013:2019) %>% arrange(year) %>% 
                         rename_at(vars(-year), ~ paste0(., "_subscription"))) %>% 
    select(-c(5,9,13,17,21,25,29,33))

# Visualisation
graph <- resume_prev %>% 
    ggplot(aes(x=year)) + 
        geom_line(aes(y = montant_apc_total_2A_1.25, col = "rush scenario (+ 25%)")) +
        geom_point(aes(y = montant_apc_total_2A_1.25), col = "#009E73") +
        #geom_line(aes(y = montant_apc_total_2A_1.2, col = "emballement gold + 20%")) +
        #geom_line(aes(y = montant_apc_total_2A_1.3, col = "emballement gold + 30%")) +
        geom_line(aes(y = montant_apc_total_2B_0.9, col = "relief scenario (- 10%)")) +
        geom_point(aes(y = montant_apc_total_2B_0.9), col = "#0072B2") +
        #geom_line(aes(y = montant_apc_total_2B_0.85, col = "apaisement hybrid - 15%")) +
        #geom_line(aes(y = montant_apc_total_2B_0.95, col = "apaisement hybrid - 5%")) +
        geom_line(aes(y = montant_apc_total_S3, col = "full gold APC")) +
        geom_point(aes(y = montant_apc_total_S3), col = "#FFCC00") +
        geom_line(aes(y = n_subscription, col = "subscription expenditures"), size = .7) +
        geom_point(aes(y = n_subscription), col = "#666666") +
        geom_line(aes(y = montant_apc_total_1, col = "trends continue unchanged"), size = .7) +
        geom_point(aes(y = montant_apc_total_1), col = "red") +
        geom_ribbon(aes(ymin = montant_apc_total_min_1, ymax = montant_apc_total_max_1), alpha = 0.1) +
        geom_ribbon(aes(ymin = montant_apc_total_min_2A_1.25, ymax = montant_apc_total_max_2A_1.25), alpha = 0.1) +
        #geom_ribbon(aes(ymin = montant_apc_total_min_2A_1.2, ymax = montant_apc_total_max_2A_1.2), alpha = 0.1) +
        #geom_ribbon(aes(ymin = montant_apc_total_min_2A_1.3, ymax = montant_apc_total_max_2A_1.3), alpha = 0.1) +
        geom_ribbon(aes(ymin = montant_apc_total_min_2B_0.9, ymax = montant_apc_total_max_2B_0.9), alpha = 0.1) +
        #geom_ribbon(aes(ymin = montant_apc_total_min_2B_0.85, ymax = montant_apc_total_max_2B_0.85), alpha = 0.1) +
        #geom_ribbon(aes(ymin = montant_apc_total_min_2B_0.95, ymax = montant_apc_total_max_2B_0.95), alpha = 0.1) +
        geom_ribbon(aes(ymin = montant_apc_total_min_S3, ymax = montant_apc_total_max_S3), alpha = 0.1) +
        geom_ribbon(aes(ymin = n_min_subscription, ymax = n_max_subscription), alpha = 0.1) +
        scale_color_manual(values = c("full gold APC" = "#FFCC00",
                                      "subscription expenditures" = "#666666",
                                      "rush scenario (+ 25%)" = "#009E73",
                                      #"emballement gold + 20%" = "#f602ed",
                                      #"emballement gold + 30%" = "#9902f6",
                                      "relief scenario (- 10%)" = "#0072B2",
                                      #"apaisement hybrid - 15%" = "#004166",
                                      #"apaisement hybrid - 5%" = "#0093e5",
                                      "trends continue unchanged" = "red"
                                      )) +
        theme_classic() +
        theme(legend.position = "right", axis.title.x = element_blank(),
              legend.text = element_text(size = 10),
              axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0))) +
        labs(color = "", y = "Total cost", title = paste("Prévisions du montant total d'APC payés selon le scénario")) +
        scale_y_continuous(labels = scales::comma) +
        scale_x_continuous(breaks=seq(2014, 2030, 2)) +
        geom_vline(xintercept = 2020, linetype = "dashed", color = "black", size = .5) +
        guides(colour = guide_legend(title = "")) +
        ggrepel::geom_text_repel(data = resume_prev %>% filter(year %% 3 == 1),
                                 aes(y = montant_apc_total_1, label = ifelse(montant_apc_total_1 < 1000000, 
                                                                         round(montant_apc_total_1, 0), 
                                                                         paste("  ", format(round(montant_apc_total_1 / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3) +
        ggrepel::geom_text_repel(data = resume_prev %>% filter(year == 2030),
                                 aes(y = montant_apc_total_S3, label = ifelse(montant_apc_total_S3 < 1000000, 
                                                                         round(montant_apc_total_S3, 0), 
                                                                         paste("  ", format(round(montant_apc_total_S3 / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3) +
        ggrepel::geom_text_repel(data = resume_prev %>% filter(year == 2030),
                                 aes(y = n_subscription, label = ifelse(n_subscription < 1000000, 
                                                                         round(n_subscription, 0), 
                                                                         paste("  ", format(round(n_subscription / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3)
graph
ggsave(file = "../figures/SVG_modelisations/65.prev_resume.svg", plot=graph, width=12, height=6)


Faceted

# Regroupement des prévisions
resume_prev2 <- rbind(predictions_all %>% select(1,44:46) %>% mutate(prevision = "trends continue unchanged"), 
                     predictions_all_color_2A_1.25 %>% select(1,26:28) %>% mutate(prevision = "rush scenario (+ 25%)"),
                     predictions_all_color_2A_1.2 %>% select(1,26:28) %>% mutate(prevision = "rush scenario (+ 20%)"),
                     predictions_all_color_2A_1.3 %>% select(1,26:28) %>% mutate(prevision = "rush scenario (+ 30%)"),
                     predictions_all_color_2B_0.9 %>% select(1,26:28) %>% mutate(prevision = "relief scenario (- 10%)"),
                     predictions_all_color_2B_0.85 %>% select(1,26:28) %>% mutate(prevision = "relief scenario (- 15%)"),
                     predictions_all_color_2B_0.95 %>% select(1,26:28) %>% mutate(prevision = "relief scenario (- 5%)"),
                     na_2020 %>% select(1,44:46) %>% mutate(prevision = "full gold APC"),
                     prev_subscription %>% complete(year = 2013:2019) %>% arrange(year) %>% 
                         rename(montant_apc_total = 2, montant_apc_total_min = 3, montant_apc_total_max = 4) %>% 
                         mutate(prevision = "subscription expenditures"))

resume_prev2$prevision <- factor(resume_prev2$prevision, ordered = TRUE, levels = c("rush scenario (+ 25%)",
                                                                                    "rush scenario (+ 20%)",
                                                                                    "rush scenario (+ 30%)",
                                                                                    "relief scenario (- 10%)",
                                                                                    "relief scenario (- 15%)",
                                                                                    "relief scenario (- 5%)",
                                                                                    "full gold APC",
                                                                                    "subscription expenditures",
                                                                                    "trends continue unchanged"))

# Visualisation
graph <- resume_prev2 %>% 
    ggplot(aes(x = year, y = montant_apc_total, group = prevision, color = prevision)) + 
        geom_line() +
        geom_point(size = 0.4) +
        facet_wrap(.~prevision, ncol = 3) +
        geom_ribbon(aes(ymin = montant_apc_total_min, ymax = montant_apc_total_max), alpha = 0.1, colour = NA) +
        scale_color_manual(values = c("#009E73", "#00d198", "#006b4e", "#0072B2", "#29bdec", "#004166", "#FFCC00", "#666666", "red")) +
        labs(color = "", y = "Total cost", title = paste("Prévisions du montant total d'APC payés selon le scénario")) +
        scale_y_continuous(labels = scales::comma) +
        scale_x_continuous(breaks=seq(2014, 2030, 4)) +
        geom_vline(xintercept = 2020, linetype = "dashed", color = "black", size = .5) +
        theme_classic() +
        theme(legend.position = "none", panel.spacing = unit(2, "lines"),
              axis.title.x = element_blank(),
              axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0))) +
        ggrepel::geom_text_repel(data = resume_prev2 %>% filter(year == 2030),
                                 aes(y = montant_apc_total, label = ifelse(montant_apc_total < 1000000, 
                                                                         round(montant_apc_total, 0), 
                                                                         paste("  ", format(round(montant_apc_total / 1e6, 2), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3)
graph
saving_plot("66.prev_resume_facet", graph)

Scénarios 1, 2A et 2B

# Visualisation
graph <- resume_prev %>% 
    ggplot(aes(x=year)) + 
        geom_line(aes(y = montant_apc_total_2A_1.25, col = "rush scenario + 25%")) +
        geom_point(aes(y = montant_apc_total_2A_1.25), col = "#009E73") +
        geom_line(aes(y = montant_apc_total_2B_0.9, col = "relief scenario - 10%")) +
        geom_point(aes(y = montant_apc_total_2B_0.9), col = "#0072B2") +
        geom_line(aes(y = montant_apc_total_1, col = "observed increase \ncontinues"), size = .7) +
        geom_point(aes(y = montant_apc_total_1), col = "red") +
        geom_ribbon(aes(ymin = montant_apc_total_min_1, ymax = montant_apc_total_max_1), alpha = 0.1) +
        geom_ribbon(aes(ymin = montant_apc_total_min_2A_1.25, ymax = montant_apc_total_max_2A_1.25), alpha = 0.1) +
        geom_ribbon(aes(ymin = montant_apc_total_min_2B_0.9, ymax = montant_apc_total_max_2B_0.9), alpha = 0.1) +
        scale_color_manual(values = c("rush scenario + 25%" = "#009E73",
                                      "relief scenario - 10%" = "#0072B2",
                                      "observed increase \ncontinues" = "red"
                                      )) +
        theme_classic() +
        theme(legend.position = "right", axis.title.x = element_blank(),
              legend.text = element_text(size = 10),
              axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0))) +
        labs(color = "", y = "Total cost of APCs", title = paste("Prévisions du montant total d'APC payés selon le scénario")) +
        scale_y_continuous(labels = scales::comma) +
        scale_x_continuous(breaks=seq(2014, 2030, 2)) +
        geom_vline(xintercept = 2020, linetype = "dashed", color = "black", size = .5) +
        guides(colour = guide_legend(title = "")) +
        ggrepel::geom_text_repel(data = resume_prev %>% filter(year %% 3 == 1),
                                 aes(y = montant_apc_total_1, label = ifelse(montant_apc_total_1 < 1000000, 
                                                                         round(montant_apc_total_1, 0), 
                                                                         paste("  ", format(round(montant_apc_total_1 / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3) +
        ggrepel::geom_text_repel(data = resume_prev %>% filter(year == 2030),
                                 aes(y = montant_apc_total_2A_1.25, label = ifelse(montant_apc_total_2A_1.25 < 1000000, 
                                                                         round(montant_apc_total_2A_1.25, 0), 
                                                                         paste("  ", format(round(montant_apc_total_2A_1.25 / 1e6, 1), trim = TRUE), "M€"))), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3)
graph
ggsave(file = "../figures/SVG_modelisations/67.prev_resume_1_2A_2B.svg", plot=graph, width=12, height=6)

Simulation 3 et abonnements

# Comparaison des prévisions des abonnements à la simulation 3
subscription_sim3 <- left_join(prev_subscription, na_2020 %>% select(year, montant_apc_total, montant_apc_total_min, montant_apc_total_max), by = "year")

graph <- subscription_sim3 %>% ggplot(aes(x=year)) + 
        geom_line(aes(y = n, col = "subscription expenditures")) +
        geom_line(aes(y = montant_apc_total, col = "full gold APC")) +
        geom_point(aes(y = n), col = "#666666", size = 1.2) +
        geom_point(aes(y = montant_apc_total), col = "#FFCC00", size = 1.2) +
        geom_ribbon(aes(ymin = n_min, ymax = n_max), alpha = 0.1) +
        geom_ribbon(aes(ymin = montant_apc_total_min, ymax = montant_apc_total_max), alpha = 0.1) +
        scale_color_manual(values = c("full gold APC" = "#FFCC00", "subscription expenditures" = "#666666")) +
        theme_classic() +
        theme(legend.position = "right", legend.text = element_text(size = 10), axis.title.x = element_blank()) +
        labs(color = "", y = "Total cost", title = "Évolution du coût des abonnements et du coût des APC dans une \nsimulation full global OA") +
        scale_y_continuous(labels = scales::comma) +
        scale_x_continuous(breaks=seq(2020, 2030, 1)) +
        geom_vline(xintercept = 2020, linetype = "dashed", color = "black", size = .5) +
        guides(colour = guide_legend(title = "")) +
        ggrepel::geom_text_repel(data = subscription_sim3 %>% filter(year %% 2 == 0),
                                 aes(y = n, label = paste("  ", format(round(n / 1e6, 1), trim = TRUE), "M€")), 
                                 color = '#333333', vjust=.35, nudge_x = .41, size=3.5, min.segment.length = 0.2, size = 3) +
        ggrepel::geom_text_repel(data = subscription_sim3 %>% filter(year %% 2 == 0),
                                 aes(y = montant_apc_total, label = paste(" ", format(round(montant_apc_total / 1e6, 1), trim = TRUE), "M€")), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3)
graph
saving_plot("68.prev_resume_3_subscriptions", graph)



V - Analyses complémentaires


Graphiques du nombre d’articles avec french CA

Couleur OA

# OA articles at the site of the publisher with CA=FR

# Table
apc_paid <- data %>% filter(is_french_CA_bso_wos_openalex_single_lang == 1) %>% group_by(year, oa_color_finale) %>% 
    filter(oa_color_finale != "other") %>% 
    summarise(n = n())
apc_paid <- pivot_wider(apc_paid, names_from = oa_color_finale, values_from = n) %>% mutate(apc_paid = gold + hybridOA) %>% 
    dplyr::rename(`APC paid (Gold \nor HybridOA)` = apc_paid,
           Gold = gold,
           HybridOA = hybridOA,
           Diamond = diamond) %>% 
    pivot_longer(cols = c(Diamond, Gold, HybridOA, `APC paid (Gold \nor HybridOA)`)) %>% 
    mutate(name = fct_reorder(name, value, tail, n = 1, .desc = T))

# Plot
graph <- apc_paid %>% ggplot(aes(x = year, y = value, group = name, color = name)) + 
        geom_line(size = 1) +
        scale_color_manual(values = c("#339900", "#FFCC00", "#CC0099", "#0072b2")) +
        labs(y = "Number of articles", title = "OA articles at the site of the publisher/platform", subtitle = "Articles with a french CA ", color = " ") +
        theme_classic() +
        theme(legend.position = "right", axis.title.x = element_blank(), 
              plot.subtitle = element_text(face = "italic", color = "#333333")) +
        scale_y_continuous(labels = scales::comma) +
        scale_x_continuous(breaks=seq(2013, 2020, 1)) +
        ggrepel::geom_text_repel(data = apc_paid,
                                 aes(y = value, label = value), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3)
graph
saving_plot("69.redo_color", graph)


Discipline

+ Toutes les disciplines

# Evolution of articles per discipline [CA-FR]

# Table
discipline <- data %>% filter(is_french_CA_bso_wos_openalex_single_lang == 1 & apc_has_been_paid == 1 & !is.na(amount_apc_EUR_trusted)) %>% group_by(year, bso_classification) %>% 
    filter(bso_classification != "unknown") %>% 
    summarise(n = n())

# Plot
graph <- discipline %>% 
    ggplot(aes(x = year, y = n, group = bso_classification, color = fct_reorder2(bso_classification, year, n))) + 
        geom_line(size = 1) +
        scale_color_manual(values = c("#9d6939", "#025bbb", "#f1e401", "#284803", "#3aaf29", "#fb6d35", "#852895", "#24c0e8", "#fcb2b8", "#919598")) +
        theme_classic() +
        theme(legend.position = "right", axis.title.x = element_blank(), 
              strip.background = element_rect(fill = "grey", colour = "#999999", linetype = 2),
              plot.subtitle = element_text(face = "italic", color = "#333333")) +
        labs(y = "Number of articles", 
             title = "Evolution of articles per discipline", 
             subtitle = "Articles with a french CA, where an APC has been paid and the amount is trusted", 
             color = " ") +
        scale_y_continuous(labels = scales::comma, breaks = seq(0, 6000, 1000)) +
        scale_x_continuous(breaks=seq(2013, 2020, 1))
graph
saving_plot("70.redo_discipline", graph)

+ Zoom sur les disciplines avec moins de 1500 articles

graph <- discipline %>% filter(n < 1500) %>% 
    ggplot(aes(x = year, y = n, group = bso_classification, color = fct_reorder2(bso_classification, year, n))) + 
        geom_line(size = 1) +
        scale_color_manual(values = c("#9d6939", "#025bbb", "#f1e401", "#284803", "#3aaf29", "#fb6d35", "#852895", "#24c0e8", "#fcb2b8", "#919598")) +
        theme_classic() +
        theme(legend.position = "right", axis.title.x = element_blank(), 
              strip.background = element_rect(fill = "grey", colour = "#999999", linetype = 2),
              plot.subtitle = element_text(face = "italic", color = "#333333")) +
        labs(y = "Number of articles", 
             title = "Evolution of articles per discipline zoomed", 
             subtitle = "Articles with a french CA, where an APC has been paid and the amount is trusted", 
             color = " ") +
        scale_y_continuous(labels = scales::comma, breaks = seq(0, 6000, 1000)) +
        scale_x_continuous(breaks=seq(2013, 2020, 1))
graph
saving_plot("70.1.redo_discipline_zommed", graph)

+ Répartition du nombre d’articles avec des APC payés sur les articles avec un CA français, par discipline

# Evolution of articles per discipline [CA-FR]

# Table
discipline_frenchCA <- data %>% filter(is_french_CA_bso_wos_openalex_single_lang == 1) %>% 
    group_by(bso_classification) %>% 
    filter(bso_classification != "unknown") %>% 
    summarise(n_frenchCA = n())
discipline_APCpaid <- data %>% filter(is_french_CA_bso_wos_openalex_single_lang == 1 & apc_has_been_paid == 1 & !is.na(amount_apc_EUR_trusted)) %>% 
    group_by(bso_classification) %>% 
    filter(bso_classification != "unknown") %>% 
    summarise(n_APCpaid = n())
discipline <- left_join(discipline_frenchCA, discipline_APCpaid, by = "bso_classification") %>% mutate(part = paste(round(n_APCpaid / n_frenchCA * 100, 0), "%", sep = ""))

# Plot
graph <- discipline %>% #mutate(bso_classification = fct_reorder2(bso_classification, n_frenchCA)) %>% 
    ggplot(aes(x = reorder(bso_classification, n_frenchCA))) + 
        geom_bar(aes(y = n_frenchCA), fill = "#599cc2", col = "white", stat = "identity") +
        geom_bar(aes(y = n_APCpaid), fill = "#0072b2", col = "white", stat = "identity") +
        theme_classic() +
        theme(legend.position = "right", axis.title.y = element_blank(), 
              strip.background = element_rect(fill = "grey", colour = "#999999", linetype = 2),
              plot.subtitle = element_text(face = "italic", color = "#333333")) +
        labs(y = "Number of articles", 
             title = "Repartition of the number of articles per discipline", 
             subtitle = "Articles where an APC has been paid and the amount is trusted, 
over total number of articles with a french CA", 
             color = " ") +
        coord_flip() +
        geom_text(aes(x = bso_classification, y = n_APCpaid + 10000, label = part), color = "white")
graph
saving_plot("70.2.discipline_repartition", graph)


Tier

# Gold or HybridOA articles per publisher tier [CA=FR]

# Table
tier <- data %>% filter(is_french_CA_bso_wos_openalex_single_lang == 1 & apc_has_been_paid == 1 & !is.na(amount_apc_EUR_trusted)) %>% group_by(year, tier) %>% 
    mutate(tier = as.factor(paste("tier", tier, sep = " "))) %>% 
    summarise(n = n())

# Plot
graph <- tier %>% 
    ggplot(aes(x = year, y = n, group = tier, color = tier)) + 
        geom_line(size = 1) +
        scale_color_manual(values = c("#009E73", "#660099", "#0072B2", "#D55E00")) +
        theme_classic() +
        theme(legend.position = "right", axis.title.x = element_blank(), 
              plot.subtitle = element_text(face = "italic", color = "#333333")) +
        labs(y = "Number of articles", 
             title = "Gold or HybridOA articles per publisher tier", 
             subtitle = "Articles with a french CA, where an APC has been paid and the amount is trusted",
             color = " ") +
        scale_y_continuous(labels = scales::comma) +
        scale_x_continuous(breaks=seq(2013, 2020, 1)) +
        ggrepel::geom_text_repel(data = tier,
                                 aes(y = n, label = n), 
                                 color = '#333333', hjust=-.25, nudge_y = 5, nudge_x = -.1, size=3.5, min.segment.length = 0.2, size = 3)
graph
saving_plot("71.redo_tier", graph)


Analyse des APC pour les articles french et non french CA

Les graphiques de la première partie, visualisant le montant moyen des APC pour les articles ayant un CA français ou non, montrent qu’en moyenne, le montant des APC est supérieur pour les articles où le CA est français. Cette section vise donc à comprendre les facteurs, en décomposant les articles en question par couleur, discipline etc. Pour cela nous comparons les moyennes d’APC des 2 sous-populations (french et non french CA) pour les autres variables catégorielles. L’analyse est réalisée sur les articles concernés par des APC et dont les montants sont fiables (apc_has_been_paid == 1 & !is.na(amount_apc_EUR_trusted)).

Couleur OA

# Calcul des moyennes par groupe pour les 2 sous-populations puis visualisation
analyse_amount <- function(group, xlab, title, ncol){
    
    data %>% 
        filter(apc_has_been_paid == 1 & !is.na(amount_apc_EUR_trusted)) %>% 
        group_by(is_french_CA_bso_wos_openalex_single_lang, year, {{group}}) %>% 
        summarise(mean = mean(amount_apc_EUR_trusted, na.rm = T)) %>% 
        na.omit() %>% 
        mutate(is_french_CA = as.factor(is_french_CA_bso_wos_openalex_single_lang)) %>% 
        ggplot(aes(x = {{group}}, y = mean, group = is_french_CA, fill = is_french_CA)) + 
        geom_bar(stat="identity", position = "dodge", width=.6, col = "white", size = 2) +
        facet_wrap(~year, ncol = ncol) +
        theme_classic() +
        labs(color = "", x = xlab, y = "Average APC", title = paste("Comparaison des APC moyens pour french et non french CA articles, \nselon", title)) +
        coord_flip() +
        scale_fill_discrete(breaks=c('1', '0'), direction = -1)

}
analyse_amount(oa_color_finale, "Couleur OA", "la couleur OA", 4)

La moyenne des APC des articles hybrid est quasiment toujours inférieure pour les articles au CA français que ceux au CA non français. Tandis que la moyenne est relativement la même pour les articles gold, la différence notable pour les articles hybrid peut être une explication aux APC moyens différents observés en analyse exploratoire (partie 1).

Tier

analyse_amount(tier, "Tier", "le tier", 4)

Ce graphique met en avant les différence d’APC moyen par tier, selon si le CA est français ou non. On voit d’emblée que pour les tiers 1, 2 et 3 (1 principalement), le montant moyen est inférieur pour les articles ayant un CA français. A contrario, pour les articles du tier 4, en moyenne le montant des APC est inférieur pour les articles au CA non français.

Discipline

analyse_amount(bso_classification, "Discipline", "la discipline", 4)

On voit que les APC des articles des disciplines “Physical sciences, astronomy”, “Medical research”, “Humanities” et “Chemistry” sont plus élevés en moyenne lorsque le CA n’est pas français.

Couperin

data <- data %>% mutate(is_covered_by_couperin = as.factor(is_covered_by_couperin))
analyse_amount(is_covered_by_couperin, "Couvert par Couperin", "qu'il soit couvert par Couperin", 4)

On ne voit pas de grande distinction pour les APC moyens des articles couverts et non couverts par les accords Couperin. Pour les articles couvert par les accords, les APC moyens sont légèrement supérieur lorsque l’auteur correspondant n’est pas français, mais la différence reste minime.

Les graphiques ont donc montrés que les APC moyens sont inférieurs lorsque le CA est français pour :

  • la couleur “hybrid” ;
  • le tier “1” et dans une moindre mesure les tiers “2” et “3” ;
  • la discipline “Humanities” et dans une moindre mesure les disciplines “Physical sciences, astronomy”, “Medical research”, “Chemistry” ;
  • la couverture par les accords Couperin (légère différence).
 

Document sous licence ouverte